System-wide installation

There are different ways to deploy and install Omnistat in a data center, and each system will generally require a certain level of customization. Here, we provide the basic manual steps to install the Omnistat client and server, and then provide an example of how to deploy Omnistat in a data center using Ansible. Finally, an approach for integrating with the SLURM workload manager to track user jobs is discussed.

For system-wide installation, we recommend creation and usage of a dedicated Linux user that will be used to run the data collector daemon of Omnistat (omnistat-monitor). In addition, per the architecture highlighted in Figure 1, a separate server (or VM/container) is needed to support installations of a Prometheus server and Grafana instance. These services can be hosted on your cluster head-node, or via a separate administrative host. Note that if the host chosen to support the Prometheus server can route out externally, you can also leverage public Grafana cloud infrastructure and forward system telemetry data to an external Grafana instance.

To reiterate, the following assumptions are made throughout the rest of this system-wide installation discussion:

Assumptions:

Installer has sudo or elevated credentials to install software system-wide, enable systemd services, and optionally modify the local SLURM configuration
ROCm v6.1 or newer is pre-installed on all GPU hosts
Installer has provisioned a dedicated user (eg. omnidc) across all desired compute nodes of their system
Installer has identified a location to host a Prometheus server (if not present already) that has network access to all compute nodes.

Omnistat software installation

To begin, we download the Omnistat software and install necessary Python dependencies. Per the assumptions above, we leverage a dedicated user to house the software install.

Download and expand latest release version.

[omnidc@login]$ REPO=https://github.com/AMDResearch/omnistat
[omnidc@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.6.0.tar.gz
[omnidc@login]$ tar xfz omnistat-1.6.0.tar.gz

Install dependencies.

[omnidc@login]$ cd omnistat-v1.6.0
[omnidc@login]$ pip install --user -r requirements.txt

At this point, we can verify basic functionality of the data collector and launch the client by hand.

Launch data collector (omnistat-monitor) interactively.
```
[omnidc@login]$ ./omnistat-monitor
```

Launching the data collector client as described above will use a set of default configuration options housed within an omnistat/config/omnistat.default file including use of port 8001 for the Prometheus client. If all went well, example output from running omnistat-monitor is highlighted below:

Reading configuration from /home1/omnidc/omnistat/omnistat/config/omnistat.default
Allowed query IPs = ['127.0.0.1']
Runtime library loaded from /opt/rocm-6.2.1/lib/librocm_smi64.so
SMI library API initialized
SMI version >= 6
Number of GPU devices = 4
GPU topology indexing: Scanning devices from /sys/class/kfd/kfd/topology/nodes
--> Mapping: {0: '3', 1: '2', 2: '1', 3: '0'}
--> Using primary temperature location at edge
--> Using HBM temperature location at hbm_0
--> [registered] rocm_temperature_celsius -> Temperature (C) (gauge)
--> [registered] rocm_temperature_hbm_celsius -> HBM Temperature (C) (gauge)
--> [registered] rocm_average_socket_power_watts -> Average Graphics Package Power (W) (gauge)
--> [registered] rocm_sclk_clock_mhz -> current sclk clock speed (Mhz) (gauge)
--> [registered] rocm_mclk_clock_mhz -> current mclk clock speed (Mhz) (gauge)
--> [registered] rocm_vram_total_bytes -> VRAM Total Memory (B) (gauge)
--> [registered] rocm_vram_used_percentage -> VRAM Memory in Use (%) (gauge)
--> [registered] rocm_vram_busy_percentage -> Memory controller activity (%) (gauge)
--> [registered] rocm_utilization_percentage -> GPU use (%) (gauge)
[2024-07-09 13:19:33 -0500] [2995880] [INFO] Starting gunicorn 21.2.0
[2024-07-09 13:19:33 -0500] [2995880] [INFO] Listening at: http://0.0.0.0:8001 (2995880)
[2024-07-09 13:19:33 -0500] [2995880] [INFO] Using worker: sync
[2024-07-09 13:19:33 -0500] [2995881] [INFO] Booting worker with pid: 2995881

Note

You can override the default runtime configuration file above by setting an OMNISTAT_CONFIG environment variable or by using the ./omnistat-monitor --configfile option.

While the client is running interactively, we can use a separate command shell to query the client to further confirm functionality. The output below highlights an example query response on a system with four GPUs installed (note that the metrics include unique card labels to differentiate specific GPU measurements):

[omnidc@login]$ curl localhost:8001/metrics | grep rocm | grep -v "^#"
rocm_num_gpus 4.0
rocm_temperature_celsius{card="3",location="edge"} 38.0
rocm_temperature_celsius{card="2",location="edge"} 43.0
rocm_temperature_celsius{card="1",location="edge"} 40.0
rocm_temperature_celsius{card="0",location="edge"} 54.0
rocm_average_socket_power_watts{card="3"} 35.0
rocm_average_socket_power_watts{card="2"} 33.0
rocm_average_socket_power_watts{card="1"} 35.0
rocm_average_socket_power_watts{card="0"} 35.0
...

Once local functionality has been established, you can terminate the interactive test (ctrl-c) and proceed with an automated startup procedure.

Enable systemd service

Now that the software is installed under a dedicated user and basic functionality has been confirmed, the data collector can be enabled for permanent service. The recommended approach for this is to leverage systemd and an example service file named omnistat.service is included in the distribution. The contents of the file are shown below with four lines highlighted in yellow that are most likely to require local customization.

[Unit]
Description=Prometheus exporter for HPC/GPU oriented metrics
Documentation=https://amdresearch.github.io/omnistat/
Requires=network-online.target
After=network-online.target

[Service]
User=omnidc
Environment="OMNISTAT_CONFIG=/home/omnidc/omnistat/omnistat/config/omnistat.default"
CPUAffinity=0
ExecStart=/home/omnidc/omnistat/omnistat-monitor
SyslogIdentifier=omnistat
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
Nice=19
Restart=on-failure

[Install]
WantedBy=multi-user.target

Using elevated credentials, install the omnistat.service file across all desired compute nodes (e.g. /etc/systemd/system/omnistat.service and enable using systemd (systemctl enable omnistat).

Access restriction configuration

By default, the omnistat data collector will only respond to queries initiated from the local host where the service is running. This functionality is controlled by a runtime configuration and generally needs to be updated to include the IP address of a companion Prometheus server in order to gather system-wide metrics (see follow-on discussion for additional details on configuring a Prometheus server). For example, if your locally configured Prometheus instance has an IP address of 10.0.0.42, update the omnistat/config/omnistat.default runtime file (or equivalent if using a custom configfile) to include the following setting:

 [omnistat.collectors]

 allowed_ips = 127.0.0.1, 10.0.0.42

Note

Alternatively, you can specify a value of allowed_ips = 0.0.0.0 to disable any access restrictions.

Host telemetry

As mentioned in the intro discussion, we recommend enablement of the popular node-exporter client for host-level monitoring including CPU load, host memory usage, I/O, and network traffic. This client is available in most standard distros and we highlight common package manager installs below. Alternatively, you can download binary distributions from here.

For Debian-based systems:

# apt-get install prometheus-node-exporter

For RHEL:

# dnf install golang-github-prometheus-node-exporter

For SUSE:

#  zypper install golang-github-prometheus-node_exporter

The relevant OS package should be installed on all desired cluster hosts and enabled for execution (e.g. systemctl enable prometheus-node-exporter on a RHEL9-based system).

Note

The default node-exporter configuration can enable a significantly large number of metrics per host and the example Grafana dashboards included with Omnistat are restricted to rely on a modest number of available metrics. In addition, if desiring to monitor InfiniBand traffic an additional module needs to be enabled. The configuration below highlights example node-exporter arguments for the /etc/default/prometheus-node-exporter file to enable InfiniBand and metrics referenced in example Omnistat dashboards.

ARGS='--collector.disable-defaults --collector.loadavg --collector.diskstats
      --collector.meminfo --collector.stat --collector.netdev
      --collector.infiniband'

Prometheus server

Once the omnistat-monitor daemon is configured and running system-wide, we next install and configure a Prometheus server to enable automatic telemetry collection. This server typically runs on an administrative host and can be installed via package manager, by downloading a precompiled binary, or using a Docker image. The install steps below highlight installation via package manager followed by a simple scrape configuration.

Install: Prometheus server (via package manager)

For Debian-based systems:

# apt-get install prometheus

For RHEL:

# dnf install golang-github-prometheus

For SUSE:

#  zypper install golang-github-prometheus-prometheus

Configuration: add a scrape configuration to Prometheus to enable telemetry collection. This configuration stanza typically resides in the /etc/prometheus/prometheus.yml runtime config file and controls which nodes to poll and at what frequency. The example below highlights configuration of two Prometheus jobs. The first enables an omnistat job to poll GPU data at 30 second intervals from four separate compute nodes. The second job enables collection of the recommended node-exporter to collect host-level data at a similar frequency (default node-exporter port is). We recommend keeping the scrape_interval setting at 5 seconds or larger.
```
scrape_configs:
  - job_name: "omnistat"
    scrape_interval: 30s
    scrape_timeout: 5s
    static_configs:
      - targets:
        - compute-00:8001
        - compute-01:8001
        - compute-02:8001
        - compute-03:8001

  - job_name: "node"
    scrape_interval: 30s
    scrape_timeout: 5s
    static_configs:
      - targets:
        - compute-00:9100
        - compute-01:9100
        - compute-02:9100
        - compute-03:9100
```

Edit your server’s prometheus.yml file using the snippet above as a guide and restart the Prometheus server to enable automatic data collection. Please also ensure that the target hosts configured for the Omnistat data collection allow queries initiated from this Prometheus server as discussed in the access restriction section.

Note

You may want to adjust the Prometheus server default storage retention policy in order to retain telemetry data longer than the default (which is typically 15 days). Assuming you are using a distro-provided version of Prometheus, you can modify the systemd launch process to include a --storage.tsdb.retention.time option as shown in the snippet below:

[Service]
Restart=on-failure
User=prometheus
EnvironmentFile=/etc/default/prometheus
ExecStart=/usr/bin/prometheus $ARGS --storage.tsdb.retention.time=3y
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no

Ansible example

For production cluster or data center deployments, configuration management tools like Ansible may be useful to automate installation of Omnistat. To aid in this process, the following example highlights key elements of an Ansible role to install necessary Python dependencies and configure the Omnistat and node-exporter Prometheus clients. These RHEL9-based example files are provided as a starting reference for system administrators and can be adjusted to suit per local conventions.

Note that this recipe assumes existence of a dedicated non-root user to run the Omnistat exporter, templated as {{ omnistat_user }}. It also assumes that an Omnistat release has been downloaded into a local path, templated to be in the {{ omnistat_dir }}.

Listing 2 roles/omnistat/tasks/main.yml

 - name: Set omnistat_dir
   set_fact:
     omnistat_dir: "/path/to/omnistat-repo"
     omnistat_user: "omnidc"

 - name: Show omnistat dir
   debug:
     msg: "Omnistat directory -> {{ omnistat_dir }}"
     verbosity: 0

 - name: Install python package dependencies
   ansible.builtin.pip:
     requirements: "{{ omnistat_dir }}/requirements.txt"
   become_user: "{{ omnistat_user }}"

 - name: Install python package dependencies to support query tool
   ansible.builtin.pip:
     requirements: "{{ omnistat_dir }}/requirements-query.txt"
   become_user: "{{ omnistat_user }}"

 #--
 # omnistat service file
 #--

 - name: install omnistat service file
   ansible.builtin.template:
     src: templates/omnistat.service.j2
     dest: /etc/systemd/system/omnistat.service
     mode: '0644'

 - name: omnistat service enabled
   ansible.builtin.service:
     name: omnistat
     enabled: yes
     state: started

 #--
 #  prometheus node exporter
 #--

 - name: node-exporter package
   ansible.builtin.yum:
     name: golang-github-prometheus-node-exporter
     state: installed

 - name: /etc/default/prometheus-node-exporter
   ansible.builtin.template:
     src:  prometheus-node-exporter.j2
     dest:  /etc/default/prometheus-node-exporter
     owner: root
     group: root
     mode: '0644'

 - name: node-exporter service enabled
   ansible.builtin.service:
     name: prometheus-node-exporter
     enabled: yes
     state: started

Listing 3 roles/omnistat/templates/omnistat.service.j2

 [Unit]
 Description=Prometheus exporter for HPC/GPU oriented metrics
 Documentation=https://amdresearch.github.io/omnistat/
 Requires=network-online.target
 After=network-online.target

 [Service]
 User={{ omnistat_user }}
 Environment="OMNISTAT_CONFIG={{ omnistat_dir }}/omnistat/config/omnistat.default"
 CPUAffinity=0
 SyslogIdentifier=omnistat
 ExecStart={{ omnistat_dir }}/omnistat-monitor
 ExecReload=/bin/kill -HUP $MAINPID
 TimeoutStopSec=20s
 SendSIGKILL=no
 Nice=19
 Restart=on-failure

 [Install]
 WantedBy=multi-user.target

Listing 4 roles/omnistat/templates/prometheus-node-exporter.j2

 ARGS='--collector.disable-defaults --collector.loadavg --collector.diskstats --collector.meminfo --collector.stat --collector.netdev --collector.infiniband'

SLURM Integration

An optional info metric capability exists within Omnistat to allow collected telemetry data to be mapped to individual jobs as they are scheduled by the resource manager. Multiple options exist to implements this integration, but the recommended approach for large-scale production resources is to leverage prolog/epilog functionality within SLURM to expose relevant job information to the Omnistat data collector. This remaining portion of this section highlights basic steps for implementing this particular strategy.

Note/Assumption: the architecture of the resource manager integration assumes that compute nodes on the cluster are allocated exclusively (ie, multiple SLURM jobs do not share the same host).

To enable resource manager tracking on the Omnistat client side, edit the chosen runtime config file and update the [omnistat.collectors] and [omnistat.collectors.rms] sections to have the following settings highlighted in yellow.

Listing 5 omnistat.default

[omnistat.collectors]
port = 8001
enable_rocm_smi = True
enable_rms = True

[omnistat.collectors.rms]
job_detection_mode = file-based
job_detection_file = /tmp/omni_rmsjobinfo

The settings above enable the resource manager collector and configures Omnistat to query the /tmp/omni_rmsjobinfo file to derive dynamic job information. This file can be generated using the omnistat-rms-env utility from within an actively running job, or during prolog execution. The resulting file contains a simple JSON format as follows:

Listing 6 /tmp/omni_rmsjobinfo

 {
     "RMS_TYPE": "slurm",
     "RMS_JOB_ID": "74129",
     "RMS_JOB_USER": "auser",
     "RMS_JOB_PARTITION": "devel",
     "RMS_JOB_NUM_NODES": "2",
     "RMS_JOB_BATCHMODE": 1,
     "RMS_STEP_ID": -1
 }

SLURM configuration update(s)

The second step to enable resource manager integration is to augment the prolog/epilog scripts configured for your local SLURM environment to create and tear-down the /tmp/omni_rmsjobinfo file. Below are example snippets that can be added to the scripts. Note that in these examples, we assume a local slurm.conf configuration where Prolog and Epilog are enabled as follows:

Prolog=/etc/slurm/slurm.prolog
Epilog=/etc/slurm/slurm.epilog

Listing 7 /etc/slurm/slurm.prolog snippet

 # cache job data for omnistat
 OMNISTAT_DIR="/home/omnidc/omnistat"
 OMNISTAT_USER=omnidc
 if [ -e ${OMNISTAT_DIR}/omnistat-rms-env ];then
     su ${OMNISTAT_USER} -c ${OMNISTAT_DIR}/omnistat-rms-env    
 fi

Listing 8 /etc/slurm/slurm.epilog snippet

 # remove cached job info to indicate end of job
 if [ -e "/tmp/omni_rmsjobinfo" ];then
     rm -f /tmp/omni_rmsjobinfo
 fi

Note

To make sure the cached job data file is created immediately upon on allocation of a user job (instead of the first srun invocation), be sure to include the following setting in your local SLURM configuration:

PrologFlags=Alloc