# Overview ```eval_rst .. toctree:: :glob: :maxdepth: 4 ``` Welcome to the documentation area for the **Omnistat** project. Use the navigation links on the left-hand side of this page to access more information on installation and capabilities. [Browse Omnistat source code on Github](https://github.com/AMDResearch/omnistat) ## What is Omnistat? Omnistat provides a set of utilities to aid cluster administrators or individual application developers to aggregate scale-out system metrics via low-overhead sampling across all hosts in a cluster or, alternatively on a subset of hosts associated with a specific user job. At its core, Omnistat was designed to aid collection of key telemetry from AMD Instinct(tm) accelerators (on a per-GPU basis). Relevant target metrics include: * GPU utilization (occupancy) * High-bandwidth memory (HBM) usage * GPU power * GPU temperature(s) * GPU clock frequency * GPU memory clock frequency * RAS information: * Error counts per GPU block * Inventory information: * ROCm driver version * GPU type * GPU vBIOS version To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server (or a [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics) server). (user-vs-system)= ## User-mode vs System-level monitoring Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows: 1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) or [Flux](https://flux-framework.org) workload managers. 1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production clusters under the auspices of a resource manager who want to gather telemetry data within a single job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a job. Resource managers supported by user-mode Omnistat include both [SLURM](https://github.com/SchedMD/slurm) and [Flux](https://flux-framework.org). To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases. ```{figure} images/architecture_system-mode.png --- name: fig-system-mode --- System-wide monitoring ```{figure} images/architecture_user-mode.png --- name: fig-user-mode --- User-mode monitoring In the __system-wide monitoring__ case, a system administrator enables data collectors permanently on all relevant hosts within the cluster and configures a Prometheus server to periodically poll these nodes (e.g. at 1 minute or 5 minute intervals). The Prometheus server typically runs on the cluster head node (or separate administrative host) and does not require GPU resources locally. For real-time and historical queries, the system administrator also enables a Grafana instance that queries the Prometheus datastore to provide a variety of visualizations with collected data. Example visualization panels using this approach are highlighted in the [Grafana](./grafana.md) section. In addition to enabling GPU metrics collection in the __system-wide monitoring__ case, sites may also wish to collect host-side metrics (CPU load, memory usage, etc). Other open-source Prometheus collectors exist for this purpose and we recommend enabling the [node-exporter](https://github.com/prometheus/node_exporter) in combination with Omnistat. Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user's job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows: (query_report_card)= ```eval_rst .. _report_card: .. code-block:: none :caption: Example telemetry summary report card in user-mode. ---------------------------------------- Omnistat Report Card for Job # 44092 ---------------------------------------- Job Overview (Num Nodes = 1, Machine = Snazzy Cluster) --> Start time = 2024-05-17 10:14:00 --> End time = 2024-05-17 10:19:00 GPU Statistics: | Utilization (%) | Memory Use (%) | Temperature (C) | Power (W) | GPU # | Max Mean | Max Mean | Max Mean | Max Mean | ------------------------------------------------------------------------------------ 0 | 100.00 55.56 | 95.09 63.40 | 61.00 52.89 | 300.00 177.67 | 1 | 100.00 44.33 | 94.97 63.32 | 64.00 54.11 | 387.00 160.22 | 2 | 100.00 55.56 | 94.92 63.28 | 60.00 51.11 | 304.00 176.78 | 3 | 100.00 55.56 | 94.78 63.20 | 58.00 48.89 | 354.00 184.67 | -- Query execution time = 0.1 secs Version = {__VERSION__} ``` ## Software dependencies The basic minimum dependencies to enable data collection via Omnistat tools in user-mode are as follows: * [ROCm](https://rocm.docs.amd.com/en/latest) (v6.1.0 or newer ) * Python dependencies (see top-level [requirements.txt](https://github.com/AMDResearch/omnistat/blob/main/requirements.txt)) System administrators wishing to deploy a system-wide GPU monitoring capability with near real-time visualization will also need one or more servers to host two additional services: * [Grafana](https://github.com/grafana/grafana) - either local instance or can also leverage cloud-based infrastructure * [Prometheus server](https://prometheus.io/docs/prometheus/latest/getting_started/) - used to periodically poll and aggregate data from multiple compute nodes ## Resource Manager Integration Omnistat can be _optionally_ configured to map telemetry tracking to specific job Ids when using the popular [SLURM](https://github.com/SchedMD/slurm) resource manager or the new [Flux](https://flux-framework.org) framework. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available: * job id * username * partition name * number of nodes allocated * batch vs interactive job Additional details on enabling this integration are discussed in the system-mode [Installation](./installation/system-install.md/#slurm-integration) section. In addition, job-oriented dashboards leveraging this feature are included in the companion [Grafana](./grafana.md) discussion.