Overview
Welcome to the documentation area for the Omnistat project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities.
Browse Omnistat source code on Github
What is Omnistat?
Omnistat provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omnistat was designed to aid collection of key telemetry from AMD Instinct™ accelerators (on a per-GPU basis). Relevant target metrics include:
GPU utilization (occupancy)
High-bandwidth memory (HBM) usage
GPU power
GPU temperature(s)
GPU clock frequency (Mhz)
GPU memory clock frequency (Mhz)
Inventory information:
ROCm driver version
GPU type
GPU vBIOS version
To enable scalable collection of these metrics, Omnistat provides a python-based Prometheus client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server.
User-mode vs System-level monitoring
Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows:
System-wide monitoring: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the SLURM workload manager.
User-mode monitoring: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line
ssh
environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job.
To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases.
Software dependencies
The basic minimum dependencies to enable data collection via Omnistat tools in user-mode are as follows:
ROCm (v6.1.0 or newer )
Python dependencies (see top-level requirements.txt)
System administrators wishing to deploy a system-wide GPU monitoring capability with near real-time visualization will also need one or more servers to host two additional services:
Grafana - either local instance or can also leverage cloud-based infrastructure
Prometheus server - used to periodically poll and aggregate data from multiple compute nodes
Resource Manager Integration
Omnistat can be optionally configured to map telemetry tracking to specific job Ids when using the popular SLURM resource manager. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available to Prometheus:
job id
username
partition name
number of nodes allocated
batch vs interactive job
Additional details on enabling this integration are discussed in the system-mode Installation section. In addition, job-oriented dashboards leveraging this feature are included in the companion Grafana discussion.