User-mode execution
In user-mode executions, Omnistat data collectors and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user’s job, as highlighted in Figure 2. The following assumptions are made throughout the rest of this user-mode installation discussion:
Assumptions:
ROCm v6.1 or newer is pre-installed on all GPU hosts.
Installer has access to a distributed file-system; if no distributed file-system is present, installation steps need to be repeated across all nodes.
Omnistat software installation
To begin, we download the Omnistat software and install necessary Python dependencies. Per the assumptions above, we download and install Omnistat in a path accessible from all nodes.
Download and expand latest release version.
[user@login]$ REPO=https://github.com/AMDResearch/omnistat [user@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.4.0.tar.gz [user@login]$ tar xfz omnistat-1.4.0.tar.gz
Install dependencies.
[user@login]$ cd omnistat-v1.4.0 [user@login]$ pip install --user -r requirements.txt [user@login]$ pip install --user -r requirements-query.txt
Note
Omnistat can also be installed as a Python package. Create a virtual environment, and install Omnistat and its dependencies from the top directory of the release.
[user@login]$ cd omnistat-v1.4.0
[user@login]$ python -m venv ~/venv/omnistat
[user@login]$ ~/venv/omnistat/bin/python -m pip install .[query]
Download a single-node VictoriaMetrics server. Assuming a
victoria-metrics
server is not already present on the system, download and extract a precompiled binary from upstream. This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution. Note that VictoriaMetrics provides a larger number binary releases and we typically use thevictoria-metrics-linux-amd64
variant on x86_64 clusters.
Configuring user-mode Omnistat
For user-mode execution, Omnistat includes additional options in the [omnistast.usermode]
section of the runtime configuration file. A portion of the default config file is highlighted below with the lines in yellow indicating settings to confirm or customize for your local environment.
[omnistat.collectors]
port = 8001
enable_rocm_smi = True
enable_rms = True
[omnistat.collectors.rms]
job_detection_mode = file-based
job_detection_file = /tmp/omni_rmsjobinfo_user
[omnistat.usermode]
ssh_key = ~/.ssh/id_rsa
victoria_binary = /path/to/victoria-metrics
victoria_datadir = data_prom
victoria_logfile = vic_server.log
push_frequency_mins = 5
Running Jobs
To enable user-mode data collection for a specifid job, add logic within your job script to start and stop the collection mechanism before and after running your desired application(s). Omnistat includes an omnistat-usermode
utility to help automate this process and the examples below highlight the steps for simple SLURM and Flux job scripts. Note that the lines highlighted in
yellow need to be customized for the local installation path.
SLURM example
#!/bin/bash
#SBATCH -N 8
#SBATCH -n 16
#SBATCH -t 02:00:00
export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat
# Beginning of job - start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 10
# Run application(s) as normal
srun <options> ./a.out
# End of job - stop data collection, generate summary and store collected data by jobid
${OMNISTAT_DIR}/omnistat-usermode --stopexporters
${OMNISTAT_DIR}/omnistat-query --job ${SLURM_JOB_ID} --interval 10
${OMNISTAT_DIR}/omnistat-usermode --stopserver
mv data_prom data_prom_${SLURM_JOB_ID}
Flux example
#!/bin/bash
#flux: -N 8
#flux: -n 16
#flux: -t 2h
jobid=`flux getattr jobid`
export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat
# Beginning of job - start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 1
# Run application(s) as normal
flux run <options> ./a.out
# End of job - stop data collection, generate summary and store collected data by jobid
${OMNISTAT_DIR}/omnistat-usermode --stopexporters
${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1
${OMNISTAT_DIR}/omnistat-usermode --stopserver
mv data_prom data_prom.${jobid}
In both examples above, the omnistat-query
utility is used at the end of the job to query collected telemetry (prior to shutting down the server) for the assigned jobid. This should embed an ascii summary for the job similar to the report card example mentioned in the Overview directly within the recorded job output.
Exploring results locally
To explore results previously gathered via Omnistat user-mode execution, we provide a Docker environment that will automatically launch the required data exploration services locally. This containerized environment includes Victoria Metrics to read and query the stored data, and Grafana as a visualization platform to display time series and other metrics. The following steps outline the general process to visualize user-mode results locally:
Download the latest Omnistat release and proceed to the
docker
directory within Omnistat.[user@login]$ REPO=https://github.com/AMDResearch/omnistat [user@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.4.0.tar.gz [user@login]$ tar xfz omnistat-1.4.0.tar.gz [user@login]$ cd omnistat-v1.4.0/docker
Copy an Omnistat database collected in usermode to the local
./data
directory. Note that all the contents of thevictoria_datadir
configuration option (orOMNISTAT_VICTORIA_DATADIR
environment variable) need to be copied recursively, typically resulting in the following hierarchy:./data/cache/ ./data/data/ ./data/flock.lock ./data/indexdb/ ./data/metadata/ ./data/snapshots/ ./data/tmp/
Start Docker environment.
[user@login]$ docker compose up
This command will download the appropriate Docker images and prepare the environment to visualize Omnistat data. If everything works as expected, the startup process will conclude with output similar to the following indicating the Omnistat dashboard is ready:
Attaching to omnistat omnistat | Executing as user 1000:1000 omnistat | Starting Victoria Metrics using ./data omnistat | Scanned database in 0.19 seconds omnistat | .. Number of jobs in the last 365 days: 1 omnistat | Omnistat dashboard ready: http://localhost:3000
Note
You can also override the default database directory by setting the
DATADIR
variable when starting the Docker containers, e.g:[user@login]$ DATADIR=/path/to/data docker compose up
Access Grafana dashboard at http://localhost:3000.
Teardown: when finished with local data exploration, you can press Ctrl+C
to stop the Docker environment. To completely remove the containers, issue:
[user@login]$ docker compose down
Combining Omnistat databases
To work with multiple Omnistat collections at the same time (e.g to explore telemetry collected from different jobs), these first need to be
merged into a single database. Omnistat’s Docker environment provides an
option to trigger a merge operation by providing a MULTIDIR
path (instead of
DATADIR
). When starting the Docker environment with this option, all databases residing
under the directory pointed to by MULTIDIR
will be loaded into a common database that will be used
to support visualization of multiple jobs.
As an example, the following
collection
directory contains two Omnistat databases under thedata-{0,1}
subdirectories:./collection/data-0/ ./collection/data-1/
Start the services with the
MULTIDIR
variable to merge multiple databases:[user@login]$ MULTIDIR=./collection docker compose up
While the services are started, a new database named
_merged
will be created automatically:./collection/data-0/ ./collection/data-1/ ./collection/_merged/
Once the merged database is ready, all the information from
data-0
anddata-1
will be visible in the local Grafana dashboard at http://localhost:3000.
Note that it is also possible to copy new databases to the same MULTIDIR
directory at a later
time. To merge a new database, simply stop the Docker Compose environment and
start it again with the same docker compose up
. Only newly copied
directories will be loaded into the merged database.
Exporting time series data
To explore and process raw Omnistat data without relying on the Docker
environment or a Prometheus/VictoriaMetrics server, the omnistat-query
tool
has an option to export all time series data to a CSV file.
${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1 --export data.csv
Exported data can be easily loaded as a data frame using tools like Pandas for further processing.
import pandas
df = pandas.read_csv("data.csv", header=[0, 1, 2], index_col=0)
# Select a single metric
df["rocm_utilization_percentage"]
# Select a single metric and node
df["rocm_utilization_percentage"]["node01"]
# Select a single metric, node, and GPU
df["rocm_utilization_percentage"]["node01"]["0"]
# Select GPU Utilization and GPU Memory Utilization for GPU ID 0 in all nodes
df.loc[:, pandas.IndexSlice[["rocm_utilization_percentage", "rocm_vram_used_percentage"], :, ["0"]]]
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv("data.csv", header=[0, 1, 2], index_col=0)
df.index = pandas.to_datetime(df.index)
# Create a new dataframe with node averages
node_mean_df = df["rocm_utilization_percentage"].T.groupby(level=['instance']).mean().T
node_mean_df.plot(linewidth=1)
plt.title("Mean utilization per node")
plt.xlabel("Time")
plt.ylabel("GPU Utilization (%)")
plt.show()