Clean, human-readable metrics for AMD GPUs. No more cryptic hardware counters.
Why Metrix?
Clean Python API with modern design
Human-readable metrics instead of raw counters
Unit tested and reliable
13 Memory Metrics : bandwidth, cache, coalescing, LDS, atomic latency
5 Compute Metrics : FLOPS, arithmetic intensity (HBM/L2/L1), compute throughput
Multi-run profiling : automatic aggregation with min/max/avg statistics
Kernel filtering : efficient regex filtering at rocprofv3 level
Multiple output formats : text, JSON, CSV
Installation
Quick start
# Profile with all metrics (architecture auto-detected)
metrix --time-only -n 10 ./my_app
metrix --kernel matmul ./my_app
metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency ./my_app
metrix -o results.json ./my_app
Python API
from metrix import Metrix
# Architecture is auto-detected
results = profiler. profile ( " ./my_app " , num_replays = 5 )
for kernel in results.kernels:
print ( f " {kernel.name} : {kernel.duration_us.avg :.2f } μs" )
for metric, stats in kernel.metrics. items ():
print ( f " {metric} : {stats.avg :.2f } " )
Available metrics
Memory bandwidth
Metric Description memory.hbm_read_bandwidthHBM read bandwidth (GB/s) memory.hbm_write_bandwidthHBM write bandwidth (GB/s) memory.hbm_bandwidth_utilization% of peak HBM bandwidth memory.bytes_transferred_hbmTotal bytes through HBM memory.bytes_transferred_l2Total bytes through L2 cache memory.bytes_transferred_l1Total bytes through L1 cache
Metric Description memory.l1_hit_rateL1 cache hit rate (%) memory.l2_hit_rateL2 cache hit rate (%) memory.l2_bandwidthL2 cache bandwidth (GB/s)
Memory access patterns
Metric Description memory.coalescing_efficiencyMemory coalescing efficiency (%) memory.global_load_efficiencyGlobal load efficiency (%) memory.global_store_efficiencyGlobal store efficiency (%)
Local data share (LDS)
Metric Description memory.lds_bank_conflictsLDS bank conflicts per instruction
Atomic operations
Metric Description memory.atomic_latencyAtomic operation latency (cycles)
Compute
Metric Description compute.total_flopsTotal floating-point operations performed compute.hbm_gflopsCompute throughput (GFLOPS) compute.hbm_arithmetic_intensityRatio of FLOPs to HBM bytes (FLOP/byte) compute.l2_arithmetic_intensityRatio of FLOPs to L2 bytes (FLOP/byte) compute.l1_arithmetic_intensityRatio of FLOPs to L1 bytes (FLOP/byte)
CLI reference
metrix [--version] <command> ...
metrix profile [options] <target>
--profile, -p Metric profile: quick | memory | memory_bandwidth |
memory_cache | compute (default: all metrics if omitted)
--metrics, -m Comma-separated list of metrics
--time-only Only collect timing, no hardware counters
--kernel, -k Filter kernels by name (regex, passed to rocprofv3)
--num-replays, -n Replay the application N times and aggregate (default: 10)
--aggregate Aggregate metrics by kernel name across replays
--top K Show only top K slowest kernels
--output, -o Output file (.json, .csv, .txt)
--timeout SECONDS Profiling timeout in seconds (default: 60)
--log, -l Logging level: debug | info | warning | error
--no-counters Omit raw counter values from output
metrix list <metrics|profiles|devices> [--category CAT]
metrix info <metric|profile> <name>
GPU architecture is auto-detected using rocminfo.
Architecture
Metrix uses a clean backend architecture where hardware counter names appear exactly once as function parameters:
@metric ( " memory.l2_hit_rate " )
def _l2_hit_rate ( self , TCC_HIT_sum , TCC_MISS_sum ) :
total = TCC_HIT_sum + TCC_MISS_sum
return ( TCC_HIT_sum / total) * 100 if total > 0 else 0.0
This eliminates error-prone mapping dictionaries and makes the codebase maintainable. Adding new metrics or supporting new GPU architectures is straightforward.
Requirements
Python 3.9+
ROCm 6.x with rocprofv3
pandas >= 1.5.0