Metrix
Clean, human-readable metrics for AMD GPUs. No more cryptic hardware counters.
Why Metrix?
- Clean Python API with modern design
- Human-readable metrics instead of raw counters
- Unit tested and reliable
- 20 metrics across memory, cache, compute, and GPU utilization (availability varies by GPU architecture)
- Multi-run profiling: automatic aggregation with min/max/avg statistics
- Kernel filtering: efficient regex filtering at rocprofv3 level
- Multiple output formats: text, JSON, CSV
Installation
pip install -e .Quick start
# Profile with all metrics (architecture auto-detected)metrix ./my_app
# Time only (fast)metrix --time-only -n 10 ./my_app
# Filter kernels by namemetrix --kernel matmul ./my_app
# Custom metricsmetrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency ./my_app
# Save to JSONmetrix -o results.json ./my_appfrom metrix import Metrix
# Architecture is auto-detectedprofiler = Metrix()results = profiler.profile("./my_app", num_replays=5)
for kernel in results.kernels: print(f"{kernel.name}: {kernel.duration_us.avg:.2f} us") for metric, stats in kernel.metrics.items(): print(f" {metric}: {stats.avg:.2f}")Available metrics
Compute
| Metric | Description |
|---|---|
compute.gpu_utilization | GPU utilization (%). gfx1201/gfx1151 only. |
compute.total_flops | Total floating-point operations performed |
compute.hbm_gflops | Compute throughput (GFLOP/s) |
compute.hbm_arithmetic_intensity | Ratio of FLOPs to HBM bytes (FLOPs/Byte) |
compute.l2_arithmetic_intensity | Ratio of FLOPs to L2 bytes (FLOPs/Byte) |
compute.l1_arithmetic_intensity | Ratio of FLOPs to L1 bytes (FLOPs/Byte) |
Memory bandwidth
| Metric | Description |
|---|---|
memory.hbm_read_bandwidth | HBM read bandwidth (GB/s) |
memory.hbm_write_bandwidth | HBM write bandwidth (GB/s) |
memory.hbm_bandwidth_utilization | % of peak HBM bandwidth |
memory.bytes_transferred_hbm | Total bytes through HBM |
memory.bytes_transferred_l2 | Total bytes through L2 cache |
memory.bytes_transferred_l1 | Total bytes through L1 cache |
Cache performance
| Metric | Description |
|---|---|
memory.l1_hit_rate | L1 cache hit rate (%) |
memory.l2_hit_rate | L2 cache hit rate (%) |
memory.l2_bandwidth | L2 cache bandwidth (GB/s) |
Memory access patterns
| Metric | Description |
|---|---|
memory.coalescing_efficiency | Memory coalescing efficiency (%) |
memory.global_load_efficiency | Global load efficiency (%) |
memory.global_store_efficiency | Global store efficiency (%) |
Local data share (LDS)
| Metric | Description |
|---|---|
memory.lds_bank_conflicts | LDS bank conflicts per access |
Atomic operations
| Metric | Description |
|---|---|
memory.atomic_latency | Atomic operation latency (cycles) |
CLI reference
metrix [--version] <command> ...
metrix profile [options] <target>
--profile, -p Metric profile: quick | memory | memory_bandwidth | memory_cache | compute (default: all metrics if omitted) --metrics, -m Comma-separated list of metrics --time-only Only collect timing, no hardware counters --kernel, -k Filter kernels by name (regex, passed to rocprofv3) --num-replays, -n Replay the application N times and aggregate (default: 10) --aggregate Aggregate metrics by kernel name across replays --top K Show only top K slowest kernels --output, -o Output file (.json, .csv, .txt) --timeout SECONDS Profiling timeout in seconds (default: 60) --log, -l Logging level: debug | info | warning | error --quiet, -q Quiet mode --no-counters Omit raw counter values from output
metrix list <metrics|profiles|devices> [--category CAT]
metrix info <metric|profile> <name>GPU architecture is auto-detected using rocminfo.
Requirements
- Python 3.9+
- ROCm 6.x with rocprofv3
- pandas >= 1.5.0