User-mode execution
In user-mode executions, Omnistat data collectors and a companion Prometheus server are deployed temporarily on hosts assigned to a user’s job, as highlighted in Figure 2. The following assumptions are made throughout the rest of this user-mode installation discussion:
Assumptions:
ROCm v6.1 or newer is pre-installed on all GPU hosts.
Installer has access to a distributed file-system; if no distributed file-system is present, installation steps need to be repeated in all nodes.
Omnistat software installation
To begin, we download the Omnistat software and install necessary Python dependencies. Per the assumptions above, we download and install Omnistat in a path accessible from all nodes.
Download and expand latest release version.
[user@login]$ REPO=https://github.com/AMDResearch/omnistat [user@login]$ curl -OLJ ${REPO}/archive/refs/tags/v1.1.0.tar.gz [user@login]$ tar xfz omnistat-1.1.0.tar.gz
Install dependencies.
[user@login]$ cd omnistat-v1.1.0 [user@login]$ pip install --user -r requirements.txt [user@login]$ pip install --user -r requirements-query.txt
Note
Omnistat can also be installed as a Python package. Create a virtual environment, and install Omnistat and its dependencies from the top directory of the release.
[user@login]$ cd omnistat-v1.1.0
[user@login]$ python -m venv ~/venv/omnistat
[user@login]$ ~/venv/omnistat/bin/python -m pip install .[query]
Download Prometheus. If a
prometheus
server is not already present on the system, download and extract a precompiled binary. This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution.
Configuring user-mode Omnistat
For user-mode execution, Omnistat includes additional options in the [omnistast.usermode]
section of the runtime configuration file. A portion of the default config file is highlighted below with the lines in yellow indicating settings to confirm or customize for your local environment.
[omnistat.collectors]
port = 8001
enable_rocm_smi = True
enable_rms = True
[omnistat.collectors.rms]
job_detection_mode = file-based
job_detection_file = /tmp/omni_rmsjobinfo_user
[omnistat.usermode]
ssh_key = ~/.ssh/id_rsa
prometheus_binary = /path/to/prometheus
prometheus_datadir = data_prom
prometheus_logfile = prom_server.log
Running a SLURM Job
In the SLURM job script, add the following lines to start and stop the data collection before and after running the application. Lines highlighted in yellow need to be customized for different installation paths.
export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat
# Start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 10
# Run application(s) as normal
srun <options> ./a.out
# End of job - generate summary report and stop data collection
${OMNISTAT_DIR}/omnistat-query --job ${SLURM_JOB_ID} --interval 10
${OMNISTAT_DIR}/omnistat-usermode --stop
Exploring results with a local Docker environment
To explore results generated for user-mode executions of Omnistat, we provide a Docker environment that will automatically launch the required services locally. That includes Prometheus to read and query the stored data, and Grafana as visualization platform to display time series and other metrics.
To explore results:
Copy Prometheus data collected with Omnistat to
./prometheus-data
. The entiredatadir
defined in the Omnistat configuration needs to be copied (e.g. adata
directory should be present under./prometheus-data
).Start services:
[user@login]$ export PROMETHEUS_USER="$(id -u):$(id -g)" [user@login]$ docker compose up -d
User and group IDs are exported with the
PROMETHEUS_USER
variable to ensure the container has the right permissions to read the local data under the./prometheus-data
directory.Access Grafana dashboard at http://localhost:3000. Note that starting Grafana can take a few seconds.
Stop services:
[user@login]$ docker compose down