User-mode execution

In user-mode executions, Omnistat data collectors and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user’s job, as highlighted in Figure 2. The following assumptions are made throughout the rest of this user-mode installation discussion:

Assumptions:

ROCm v6.1 or newer is pre-installed on all GPU hosts.
Installer has access to a distributed file-system; if no distributed file-system is present, installation steps need to be repeated across all nodes.

Omnistat software installation

To begin, we download the Omnistat software and install necessary Python dependencies. Per the assumptions above, we download and install Omnistat in a path accessible from all nodes.

Download and expand latest release version.

[user@login]$ REPO=https://github.com/AMDResearch/omnistat
[user@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.5.1.tar.gz
[user@login]$ tar xfz omnistat-1.5.1.tar.gz

Install dependencies.

[user@login]$ cd omnistat-v1.5.1
[user@login]$ pip install --user -r requirements.txt
[user@login]$ pip install --user -r requirements-query.txt

Note

Omnistat can also be installed as a Python package. Create a virtual environment, and install Omnistat and its dependencies from the top directory of the release.

[user@login]$ cd omnistat-v1.5.1
[user@login]$ python -m venv ~/venv/omnistat
[user@login]$ ~/venv/omnistat/bin/python -m pip install .[query]

Download a single-node VictoriaMetrics server. Assuming a victoria-metrics server is not already present on the system, download and extract a precompiled binary from upstream. This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution. Note that VictoriaMetrics provides a larger number binary releases and we typically use the victoria-metrics-linux-amd64 variant on x86_64 clusters.

Configuring user-mode Omnistat

For user-mode execution, Omnistat includes additional options in the [omnistast.usermode] section of the runtime configuration file. A portion of the default config file is highlighted below with the lines in yellow indicating settings to confirm or customize for your local environment.

Listing 9 Sample Omnistat configuration file

 [omnistat.collectors]
 port = 8001
 enable_rocm_smi = True
 enable_rms = True

 [omnistat.collectors.rms]
 job_detection_mode = file-based
 job_detection_file = /tmp/omni_rmsjobinfo_user

 [omnistat.usermode]
 victoria_binary = /path/to/victoria-metrics
 victoria_datadir = data_prom
 victoria_logfile = vic_server.log

Running Jobs

To enable user-mode data collection for a specifid job, add logic within your job script to start and stop the collection mechanism before and after running your desired application(s). Omnistat includes an omnistat-usermode utility to help automate this process and the examples below highlight the steps for simple SLURM and Flux job scripts. Note that the lines highlighted in yellow need to be customized for the local installation path.

SLURM example

Listing 10 Example SLURM job file using user-mode Omnistat with a 10 second sampling interval

#!/bin/bash
#SBATCH -N 8
#SBATCH -n 16
#SBATCH -t 02:00:00

 export OMNISTAT_CONFIG=/path/to/omnistat.config
 export OMNISTAT_DIR=/path/to/omnistat

 # Beginning of job - start data collector
 ${OMNISTAT_DIR}/omnistat-usermode --start --interval 10

 # Run application(s) as normal
 srun <options> ./a.out

 # End of job -  stop data collection, generate summary and store collected data by jobid
 ${OMNISTAT_DIR}/omnistat-usermode --stopexporters
 ${OMNISTAT_DIR}/omnistat-query --job ${SLURM_JOB_ID} --interval 10
 ${OMNISTAT_DIR}/omnistat-usermode --stopserver
 mv data_prom data_prom_${SLURM_JOB_ID}

Flux example

Listing 11 Example FLUX job file using user-mode Omnistat with a 1 second sampling interval

#!/bin/bash
#flux: -N 8
#flux: -n 16
#flux: -t 2h

jobid=`flux getattr jobid`

export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat

# Beginning of job - start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 1

# Run application(s) as normal
flux run <options> ./a.out

# End of job -  stop data collection, generate summary and store collected data by jobid
${OMNISTAT_DIR}/omnistat-usermode --stopexporters
${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1
${OMNISTAT_DIR}/omnistat-usermode --stopserver
mv data_prom data_prom.${jobid}

In both examples above, the omnistat-query utility is used at the end of the job to query collected telemetry (prior to shutting down the server) for the assigned jobid. This should embed an ascii summary for the job similar to the report card example mentioned in the Overview directly within the recorded job output.

Exploring results locally

To explore results previously gathered via Omnistat user-mode execution, we provide a Docker environment that will automatically launch the required data exploration services locally. This containerized environment includes Victoria Metrics to read and query the stored data, and Grafana as a visualization platform to display time series and other metrics. The following steps outline the general process to visualize user-mode results locally:

Download the latest Omnistat release and proceed to the docker directory within Omnistat.

[user@login]$ REPO=https://github.com/AMDResearch/omnistat
[user@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.5.1.tar.gz
[user@login]$ tar xfz omnistat-1.5.1.tar.gz
[user@login]$ cd omnistat-v1.5.1/docker

Copy an Omnistat database collected in usermode to the local ./data directory. Note that all the contents of the victoria_datadir configuration option (or OMNISTAT_VICTORIA_DATADIR environment variable) need to be copied recursively, typically resulting in the following hierarchy:
```
./data/cache/
./data/data/
./data/flock.lock
./data/indexdb/
./data/metadata/
./data/snapshots/
./data/tmp/
```
Start Docker environment.
```
[user@login]$ docker compose up
```
This command will download the appropriate Docker images and prepare the environment to visualize Omnistat data. If everything works as expected, the startup process will conclude with output similar to the following indicating the Omnistat dashboard is ready:
```
Attaching to omnistat
omnistat  | Executing as user 1000:1000
omnistat  | Starting Victoria Metrics using ./data
omnistat  | Scanned database in 0.19 seconds
omnistat  | .. Number of jobs in the last 365 days: 1
omnistat  | Omnistat dashboard ready: http://localhost:3000
```
Note

You can also override the default database directory by setting the DATADIR variable when starting the Docker containers, e.g:
```
[user@login]$ DATADIR=/path/to/data docker compose up
```
Access Grafana dashboard at http://localhost:3000.

Teardown: when finished with local data exploration, you can press Ctrl+C to stop the Docker environment. To completely remove the containers, issue:

[user@login]$ docker compose down

Combining Omnistat databases

To work with multiple Omnistat collections at the same time (e.g to explore telemetry collected from different jobs), these first need to be merged into a single database. Omnistat’s Docker environment provides an option to trigger a merge operation by providing a MULTIDIR path (instead of DATADIR). When starting the Docker environment with this option, all databases residing under the directory pointed to by MULTIDIR will be loaded into a common database that will be used to support visualization of multiple jobs.

As an example, the following collection directory contains two Omnistat databases under the data-{0,1} subdirectories:
```
./collection/data-0/
./collection/data-1/
```
Start the services with the MULTIDIR variable to merge multiple databases:
```
[user@login]$ MULTIDIR=./collection docker compose up
```
While the services are started, a new database named _merged will be created automatically:
```
./collection/data-0/
./collection/data-1/
./collection/_merged/
```
Once the merged database is ready, all the information from data-0 and data-1 will be visible in the local Grafana dashboard at http://localhost:3000.

Note that it is also possible to copy new databases to the same MULTIDIR directory at a later time. To merge a new database, simply stop the Docker Compose environment and start it again with the same docker compose up. Only newly copied directories will be loaded into the merged database.

Exporting time series data

To explore and process raw Omnistat data without relying on the Docker environment or a Prometheus/VictoriaMetrics server, the omnistat-query tool has an option to export all time series data to CSV files.

${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1 --export

Exported CSV files are stored in the current directory by default. The --export flag accepts an optional argument to write CSV files to a different location. For example, --export export-data will store exported CSV files under the export-data directory.

The export functionality will generate one or more CSV files, depending on which collectors are enabled in the Omnistat configuration, as outlined in the following table.

File	Collector	Description
`omnistat-rocm.gpu.csv`	`rocm_smi` or `amd_smi`	GPU-level utilization and telemetry.
`omnistat-network.csv`	`network`	Network bandwidth.

Exported data can be easily loaded as a data frame using tools like Pandas for further processing.

Listing 12 Python script to read exported time series as a Pandas data frame

import pandas

df = pandas.read_csv("omnistat-rocm.gpu.csv", header=[0, 1, 2], index_col=0)

# Select a single metric
df["rocm_utilization_percentage"]

# Select a single metric and node
df["rocm_utilization_percentage"]["node01"]

# Select a single metric, node, and GPU
df["rocm_utilization_percentage"]["node01"]["0"]

# Select GPU Utilization and GPU Memory Utilization for GPU ID 0 in all nodes
df.loc[:, pandas.IndexSlice[["rocm_utilization_percentage", "rocm_vram_used_percentage"], :, ["0"]]]

Listing 13 Python script to plot average GPU Utilization per node

import pandas
import matplotlib.pyplot as plt

df = pandas.read_csv("omnistat-rocm.gpu.csv", header=[0, 1, 2], index_col=0)
df.index = pandas.to_datetime(df.index)

# Create a new dataframe with node averages
node_mean_df = df["rocm_utilization_percentage"].T.groupby(level=['instance']).mean().T

node_mean_df.plot(linewidth=1)
plt.title("Mean utilization per node")
plt.xlabel("Time")
plt.ylabel("GPU Utilization (%)")
plt.show()