Quick Start
This walkthrough profiles a GPU application with Metrix to get human-readable performance metrics.
Profile your application
CLI
# Profile with all metrics (GPU architecture auto-detected)metrix ./my_app
# Time only (fast)metrix --time-only -n 10 ./my_app
# Filter kernels by namemetrix --kernel matmul ./my_app
# Specific metricsmetrix --metrics memory.hbm_bandwidth_utilization,memory.l2_hit_rate ./my_appPython API
from metrix import Metrix
profiler = Metrix()results = profiler.profile("./my_app", num_replays=5)
for kernel in results.kernels: print(f"{kernel.name}: {kernel.duration_us.avg:.2f} μs") for metric, stats in kernel.metrics.items(): print(f" {metric}: {stats.avg:.2f}")Example output
================================================================================Metrix: all metrics (12 total)Target: ./examples/01_vector_add/vector_add================================================================================
────────────────────────────────────────────────────────────────────────────────Dispatch #1: vector_add(float*, float const*, float const*, int)────────────────────────────────────────────────────────────────────────────────Duration: 7.29 - 7.29 μs (avg=7.29)
MEMORY BANDWIDTH: Total HBM Bytes Transferred 8400896.00 bytes HBM Bandwidth Utilization 1.34 percent HBM Read Bandwidth 35.47 GB/s HBM Write Bandwidth 35.36 GB/s
CACHE PERFORMANCE: L1 Cache Hit Rate 66.67 percent L2 Cache Hit Rate 26.72 percentNext steps
- Dive deeper into profiling — see Metrix for all available metrics
- Map performance to source lines — see Linex for source-level profiling
- Extract and isolate a kernel — see Kerncap for standalone reproducers
- Inspect GPU execution — see Nexus for HSA packet tracing
- Validate optimizations — see Accordo for correctness checking
- Set up MCP servers — see MCP Setup for LLM integration