Developer Guide

The core telemetry collection facilities within Omnistat are oriented around GPU metrics. However, Omnistat is designed with extensibility in mind and adopts an object oriented approach using abstract base classes in Python to facilitate implementation of multiple data collectors. This functionality allows developers to extend Omnistat to add custom data collectors relatively easily by instantiating additional instances of the Collector class highlighted below.

Listing 15 Base class definition housed in omnistat/collector_base.py
# Base Collector class - defines required methods for all metric collectors
# implemented as a child class.

from abc import ABC, abstractmethod

class Collector(ABC):
   # Required methods to be implemented by child classes
   @abstractmethod
   def registerMetrics(self):
      """Defines desired metrics to monitor with Prometheus. Called only once."""
      pass

   @abstractmethod
   def updateMetrics(self):
      """Updates defined metrics with latest values. Called at every polling interval."""
      pass

As shown above, the base Collector class requires developers to implement two methods when adding a new data collection mechanism:

  1. registerMetrics(): this method is called once during Omnistat startup process and defines one or more Prometheus metrics to be monitored by the new collector.

  2. updateMetrics(): this method is called during every sampling request and is tasked with updating all defined metrics with the latest measured values.

Note: developers are free to implement other supporting routines to assist in their data collection needs, but are required to implement the two named methods above.

Example collector addition

To demonstrate the high-level steps for this process, this section walks thru the steps needed to create an additional collection mechanism within Omnistat to track a node-level metric. For this example, we assume a developer has already cloned the Omnistat repository locally and has all necessary Python dependencies installed per the Installation discussion.

The specific goal of this example is to extend Omnistat with a new collector that provides a gauge metric called node_uptime_secs. This metric will derive information from the proc/uptime file to track node uptime in seconds. In addition, since it is common to include labels with Prometheus metrics, we will include a label on the node_uptime_secs metric that tracks the local running Linux kernel version.

Note

We prefer to always embed the metric units directly into the name of the metric to avoid ambiguity.

Add runtime config option for new collector

To begin enabling optional support for this new collector, let’s first add a runtime option that can be queried during initialization to decide whether to enable the collector or not. This requires changes to the initialization method of the Monitor class of Omnistat housed within the monitor.py source file. The code snippet below highlights addition of this new runtime option called enable_uptime that defaults to False (meaning, not enabled by default).

Listing 16 Code modification for Monitor::__init__ method in omnistat/monitor.py (new runtime option)
self.runtimeConfig = {}

self.runtimeConfig["collector_enable_rocm_smi"] = config["omnistat.collectors"].getboolean("enable_rocm_smi", True)
self.runtimeConfig["collector_enable_rms"] = config["omnistat.collectors"].getboolean("enable_rms", False)
self.runtimeConfig["collector_enable_amd_smi"] = config["omnistat.collectors"].getboolean("enable_amd_smi", False)
self.runtimeConfig["collector_enable_uptime"] = config["omnistat.collectors"].getboolean("enable_uptime", False)

Implement the uptime data collector

Next, let’s implement the actual data collection mechanism. Recall that we simply need to implement two methods leveraging the Collector base class provided by Omnistat and the code listing below shows a complete working example. Note that Omnistat data collectors leverage the Python prometheus client to define Gauge metrics. In this example, we include a kernel label for the node_uptime_secs metric that is determined from /proc/version during initialization. The node uptime is determined from /proc/uptime and is updated on every call to updateMetrics().

Listing 17 Code example implementing an uptime collector: omnistat/collector_uptime.py
import logging

from prometheus_client import Gauge

from omnistat.collector_base import Collector


class NODEUptime(Collector):
    def __init__(self):
        logging.debug("Initializing node uptime event collector")
        self.__metrics = {}  # method storage for Prometheus metrics
        self.__kernelver = None  # method storage for kernel version

    # Required child methods
    def registerMetrics(self):

        # gather local Linux kernel to store as a label
        with open("/proc/version", "r") as f:
            self.__kernelver = f.readline().split()[2]

        metricName = "node_uptime_secs"
        description = "System uptime (secs)"
        labels = ["kernel"]
        self.__metrics[metricName] = Gauge(metricName, description, labels)
        logging.info("--> [registered] %s -> %s (gauge)" % (metricName, description))
        return

    def updateMetrics(self):
        # snarf current uptime; file contains two floats - first number is uptime in seconds
        with open("/proc/uptime", "r") as f:
            uptime = float(f.readline().split()[0])
            self.__metrics["node_uptime_secs"].labels(kernel=self.__kernelver).set(uptime)
        return

Register the new collector

Assuming the raw data collector code from the previous step has been stored locally as omnistat/collector_uptime.py file, the final step is to register the new collector when the runtime option is enabled. This modification also needs to amend the initialization method for the Monitor class residing in monitor.py with the changes necessary highlighted below.

Listing 18 Code modification for Monitor::__init__ method in omnistat/monitor.py (register collector)
      if self.runtimeConfig["collector_enable_events"]:
         from omnistat.collector_events import ROCMEvents
         self.__collectors.append(ROCMEvents())

      if self.runtimeConfig["collector_enable_uptime"]:
         from omnistat.collector_uptime import NODEUptime
         self.__collectors.append(NODEUptime())

Putting it all together

Following the three steps above to implement a new uptime data collector, we should now be able to run the omnistat-monitor data collector interactively to confirm availability of the additional metric. Since we configured this to be an optional collector that is not enabled by default, we need to first modify the runtime configuration file to enable the new option. To do this, add the highlighted line below to the local omnistat/config/omnistat.default file.

[omnistat.collectors]

port = 8001
enable_rocm_smi = True
enable_amd_smi = False
enable_rms = False
enable_uptime = True

Now, launch data collector interactively:

[omnidc@login]$ ./omnistat-monitor

If all went well, we should see a new log message for the node_uptime_secs metric.

Reading configuration from /home1/omnidc/omnistat/omnistat/config/omnistat.default
...
GPU topology indexing: Scanning devices from /sys/class/kfd/kfd/topology/nodes
--> Mapping: {0: '3', 1: '2', 2: '1', 3: '0'}
--> Using primary temperature location at edge
--> Using HBM temperature location at hbm_0
--> [registered] rocm_temperature_celsius -> Temperature (C) (gauge)
--> [registered] rocm_temperature_hbm_celsius -> HBM Temperature (C) (gauge)
--> [registered] rocm_average_socket_power_watts -> Average Graphics Package Power (W) (gauge)
--> [registered] rocm_sclk_clock_mhz -> current sclk clock speed (Mhz) (gauge)
--> [registered] rocm_mclk_clock_mhz -> current mclk clock speed (Mhz) (gauge)
--> [registered] rocm_vram_total_bytes -> VRAM Total Memory (B) (gauge)
--> [registered] rocm_vram_used_percentage -> VRAM Memory in Use (%) (gauge)
--> [registered] rocm_vram_busy_percentage -> Memory controller activity (%) (gauge)
--> [registered] rocm_utilization_percentage -> GPU use (%) (gauge)
--> [registered] node_uptime_secs -> System uptime (secs) (gauge)

As a final test while the omnistat-monitor client is still running interactively, use a separate command shell to query the prometheus endpoint.

[omnidc@login]$ curl localhost:8001/metrics | grep -v "^#"
rocm_num_gpus 4.0
rocm_temperature_celsius{card="3",location="edge"} 38.0
rocm_temperature_celsius{card="2",location="edge"} 43.0
rocm_temperature_celsius{card="1",location="edge"} 40.0
rocm_temperature_celsius{card="0",location="edge"} 54.0
rocm_average_socket_power_watts{card="3"} 35.0
rocm_average_socket_power_watts{card="2"} 33.0
rocm_average_socket_power_watts{card="1"} 35.0
rocm_average_socket_power_watts{card="0"} 35.0
...
node_uptime_secs{kernel="5.14.0-162.18.1.el9_1.x86_64"} 280345.19

Here we see the new metric reporting the latest node uptime along with the locally running kernel version embedded as a label. Wahoo, we did a thing.