Skip to content

Metrix

Clean, human-readable metrics for AMD GPUs. No more cryptic hardware counters.

Why Metrix?

  • Clean Python API with modern design
  • Human-readable metrics instead of raw counters
  • Unit tested and reliable
  • 20 metrics across memory, cache, compute, and GPU utilization (availability varies by GPU architecture)
  • Multi-run profiling: automatic aggregation with min/max/avg statistics
  • Kernel filtering: efficient regex filtering at rocprofv3 level
  • Multiple output formats: text, JSON, CSV

Installation

Terminal window
pip install -e .

Quick start

Terminal window
# Profile with all metrics (architecture auto-detected)
metrix ./my_app
# Time only (fast)
metrix --time-only -n 10 ./my_app
# Filter kernels by name
metrix --kernel matmul ./my_app
# Custom metrics
metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency ./my_app
# Save to JSON
metrix -o results.json ./my_app

Available metrics

Compute

MetricDescription
compute.gpu_utilizationGPU utilization (%). gfx1201/gfx1151 only.
compute.total_flopsTotal floating-point operations performed
compute.hbm_gflopsCompute throughput (GFLOP/s)
compute.hbm_arithmetic_intensityRatio of FLOPs to HBM bytes (FLOPs/Byte)
compute.l2_arithmetic_intensityRatio of FLOPs to L2 bytes (FLOPs/Byte)
compute.l1_arithmetic_intensityRatio of FLOPs to L1 bytes (FLOPs/Byte)

Memory bandwidth

MetricDescription
memory.hbm_read_bandwidthHBM read bandwidth (GB/s)
memory.hbm_write_bandwidthHBM write bandwidth (GB/s)
memory.hbm_bandwidth_utilization% of peak HBM bandwidth
memory.bytes_transferred_hbmTotal bytes through HBM
memory.bytes_transferred_l2Total bytes through L2 cache
memory.bytes_transferred_l1Total bytes through L1 cache

Cache performance

MetricDescription
memory.l1_hit_rateL1 cache hit rate (%)
memory.l2_hit_rateL2 cache hit rate (%)
memory.l2_bandwidthL2 cache bandwidth (GB/s)

Memory access patterns

MetricDescription
memory.coalescing_efficiencyMemory coalescing efficiency (%)
memory.global_load_efficiencyGlobal load efficiency (%)
memory.global_store_efficiencyGlobal store efficiency (%)

Local data share (LDS)

MetricDescription
memory.lds_bank_conflictsLDS bank conflicts per access

Atomic operations

MetricDescription
memory.atomic_latencyAtomic operation latency (cycles)

CLI reference

metrix [--version] <command> ...
metrix profile [options] <target>
--profile, -p Metric profile: quick | memory | memory_bandwidth |
memory_cache | compute (default: all metrics if omitted)
--metrics, -m Comma-separated list of metrics
--time-only Only collect timing, no hardware counters
--kernel, -k Filter kernels by name (regex, passed to rocprofv3)
--num-replays, -n Replay the application N times and aggregate (default: 10)
--aggregate Aggregate metrics by kernel name across replays
--top K Show only top K slowest kernels
--output, -o Output file (.json, .csv, .txt)
--timeout SECONDS Profiling timeout in seconds (default: 60)
--log, -l Logging level: debug | info | warning | error
--quiet, -q Quiet mode
--no-counters Omit raw counter values from output
metrix list <metrics|profiles|devices> [--category CAT]
metrix info <metric|profile> <name>

GPU architecture is auto-detected using rocminfo.

Requirements

  • Python 3.9+
  • ROCm 6.x with rocprofv3
  • pandas >= 1.5.0