Skip to content

Metrix

Clean, human-readable metrics for AMD GPUs. No more cryptic hardware counters.

Why Metrix?

  • Clean Python API with modern design
  • Human-readable metrics instead of raw counters
  • Unit tested and reliable
  • 13 Memory Metrics: bandwidth, cache, coalescing, LDS, atomic latency
  • 5 Compute Metrics: FLOPS, arithmetic intensity (HBM/L2/L1), compute throughput
  • Multi-run profiling: automatic aggregation with min/max/avg statistics
  • Kernel filtering: efficient regex filtering at rocprofv3 level
  • Multiple output formats: text, JSON, CSV

Installation

Terminal window
pip install -e .

Quick start

Terminal window
# Profile with all metrics (architecture auto-detected)
metrix ./my_app
# Time only (fast)
metrix --time-only -n 10 ./my_app
# Filter kernels by name
metrix --kernel matmul ./my_app
# Custom metrics
metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency ./my_app
# Save to JSON
metrix -o results.json ./my_app

Python API

from metrix import Metrix
# Architecture is auto-detected
profiler = Metrix()
results = profiler.profile("./my_app", num_replays=5)
for kernel in results.kernels:
print(f"{kernel.name}: {kernel.duration_us.avg:.2f} μs")
for metric, stats in kernel.metrics.items():
print(f" {metric}: {stats.avg:.2f}")

Available metrics

Memory bandwidth

MetricDescription
memory.hbm_read_bandwidthHBM read bandwidth (GB/s)
memory.hbm_write_bandwidthHBM write bandwidth (GB/s)
memory.hbm_bandwidth_utilization% of peak HBM bandwidth
memory.bytes_transferred_hbmTotal bytes through HBM
memory.bytes_transferred_l2Total bytes through L2 cache
memory.bytes_transferred_l1Total bytes through L1 cache

Cache performance

MetricDescription
memory.l1_hit_rateL1 cache hit rate (%)
memory.l2_hit_rateL2 cache hit rate (%)
memory.l2_bandwidthL2 cache bandwidth (GB/s)

Memory access patterns

MetricDescription
memory.coalescing_efficiencyMemory coalescing efficiency (%)
memory.global_load_efficiencyGlobal load efficiency (%)
memory.global_store_efficiencyGlobal store efficiency (%)

Local data share (LDS)

MetricDescription
memory.lds_bank_conflictsLDS bank conflicts per instruction

Atomic operations

MetricDescription
memory.atomic_latencyAtomic operation latency (cycles)

Compute

MetricDescription
compute.total_flopsTotal floating-point operations performed
compute.hbm_gflopsCompute throughput (GFLOPS)
compute.hbm_arithmetic_intensityRatio of FLOPs to HBM bytes (FLOP/byte)
compute.l2_arithmetic_intensityRatio of FLOPs to L2 bytes (FLOP/byte)
compute.l1_arithmetic_intensityRatio of FLOPs to L1 bytes (FLOP/byte)

CLI reference

metrix [--version] <command> ...
metrix profile [options] <target>
--profile, -p Metric profile: quick | memory | memory_bandwidth |
memory_cache | compute (default: all metrics if omitted)
--metrics, -m Comma-separated list of metrics
--time-only Only collect timing, no hardware counters
--kernel, -k Filter kernels by name (regex, passed to rocprofv3)
--num-replays, -n Replay the application N times and aggregate (default: 10)
--aggregate Aggregate metrics by kernel name across replays
--top K Show only top K slowest kernels
--output, -o Output file (.json, .csv, .txt)
--timeout SECONDS Profiling timeout in seconds (default: 60)
--log, -l Logging level: debug | info | warning | error
--quiet, -q Quiet mode
--no-counters Omit raw counter values from output
metrix list <metrics|profiles|devices> [--category CAT]
metrix info <metric|profile> <name>

GPU architecture is auto-detected using rocminfo.

Architecture

Metrix uses a clean backend architecture where hardware counter names appear exactly once as function parameters:

@metric("memory.l2_hit_rate")
def _l2_hit_rate(self, TCC_HIT_sum, TCC_MISS_sum):
total = TCC_HIT_sum + TCC_MISS_sum
return (TCC_HIT_sum / total) * 100 if total > 0 else 0.0

This eliminates error-prone mapping dictionaries and makes the codebase maintainable. Adding new metrics or supporting new GPU architectures is straightforward.

Requirements

  • Python 3.9+
  • ROCm 6.x with rocprofv3
  • pandas >= 1.5.0