Skip to content

Metrix

Clean, human-readable metrics for AMD GPUs. No more cryptic hardware counters.

Why Metrix?

Clean Python API with modern design
Human-readable metrics instead of raw counters
Unit tested and reliable
20 metrics across memory, cache, compute, and GPU utilization (availability varies by GPU architecture)
Multi-run profiling: automatic aggregation with min/max/avg statistics
Kernel filtering: efficient regex filtering at rocprofv3 level
Multiple output formats: text, JSON, CSV

Installation

pip install -e .

Quick start

CLI
Python API

# Profile with all metrics (architecture auto-detected)
metrix ./my_app

# Time only (fast)
metrix --time-only -n 10 ./my_app

# Filter kernels by name
metrix --kernel matmul ./my_app

# Custom metrics
metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency ./my_app

# Save to JSON
metrix -o results.json ./my_app

from metrix import Metrix

# Architecture is auto-detected
profiler = Metrix()
results = profiler.profile("./my_app", num_replays=5)

for kernel in results.kernels:
    print(f"{kernel.name}: {kernel.duration_us.avg:.2f} us")
    for metric, stats in kernel.metrics.items():
        print(f"  {metric}: {stats.avg:.2f}")

Available metrics

Compute

Metric	Description
`compute.gpu_utilization`	GPU utilization (%). gfx1201/gfx1151 only.
`compute.total_flops`	Total floating-point operations performed
`compute.hbm_gflops`	Compute throughput (GFLOP/s)
`compute.hbm_arithmetic_intensity`	Ratio of FLOPs to HBM bytes (FLOPs/Byte)
`compute.l2_arithmetic_intensity`	Ratio of FLOPs to L2 bytes (FLOPs/Byte)
`compute.l1_arithmetic_intensity`	Ratio of FLOPs to L1 bytes (FLOPs/Byte)

Memory bandwidth

Metric	Description
`memory.hbm_read_bandwidth`	HBM read bandwidth (GB/s)
`memory.hbm_write_bandwidth`	HBM write bandwidth (GB/s)
`memory.hbm_bandwidth_utilization`	% of peak HBM bandwidth
`memory.bytes_transferred_hbm`	Total bytes through HBM
`memory.bytes_transferred_l2`	Total bytes through L2 cache
`memory.bytes_transferred_l1`	Total bytes through L1 cache

Cache performance

Metric	Description
`memory.l1_hit_rate`	L1 cache hit rate (%)
`memory.l2_hit_rate`	L2 cache hit rate (%)
`memory.l2_bandwidth`	L2 cache bandwidth (GB/s)

Memory access patterns

Metric	Description
`memory.coalescing_efficiency`	Memory coalescing efficiency (%)
`memory.global_load_efficiency`	Global load efficiency (%)
`memory.global_store_efficiency`	Global store efficiency (%)

Metric	Description
`memory.lds_bank_conflicts`	LDS bank conflicts per access

Atomic operations

Metric	Description
`memory.atomic_latency`	Atomic operation latency (cycles)

CLI reference

metrix [--version] <command> ...

metrix profile [options] <target>

  --profile, -p      Metric profile: quick | memory | memory_bandwidth |
                     memory_cache | compute (default: all metrics if omitted)
  --metrics, -m      Comma-separated list of metrics
  --time-only        Only collect timing, no hardware counters
  --kernel, -k       Filter kernels by name (regex, passed to rocprofv3)
  --num-replays, -n  Replay the application N times and aggregate (default: 10)
  --aggregate        Aggregate metrics by kernel name across replays
  --top K            Show only top K slowest kernels
  --output, -o       Output file (.json, .csv, .txt)
  --timeout SECONDS  Profiling timeout in seconds (default: 60)
  --log, -l          Logging level: debug | info | warning | error
  --quiet, -q        Quiet mode
  --no-counters      Omit raw counter values from output

metrix list <metrics|profiles|devices> [--category CAT]

metrix info <metric|profile> <name>

GPU architecture is auto-detected using rocminfo.

Requirements

Python 3.9+
ROCm 6.x with rocprofv3
pandas >= 1.5.0