Linex
Map GPU performance metrics to your source code lines.
Installation
pip install -e .Quick start
from linex import Linex
profiler = Linex()profiler.profile("./my_app", kernel_filter="my_kernel")
# Show hotspotsfor line in profiler.source_lines[:5]: print(f"{line.file}:{line.line_number}") print(f" {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")What you get
Instruction-level metrics mapped to source lines:
| Metric | Description |
|---|---|
latency_cycles | Total GPU cycles |
stall_cycles | Cycles waiting (memory, dependencies) |
idle_cycles | Unused execution slots |
execution_count | How many times it ran |
instruction_address | Where in GPU memory |
Compiling with and without -g
| Build | instructions | source_lines | file / line |
|---|---|---|---|
With -g | Populated (ISA + cycles) | Populated (aggregated by file:line) | Real file path and line number |
Without -g | Populated (ISA + cycles) | Empty | "" and 0 |
- Use
-gwhen you want source-line mapping: ISA instructions tied tofile:line, andsource_linesaggregated by source line. - Omit
-gwhen you only need assembly-level metrics: you still get every instruction withisa,latency_cycles,stall_cycles, etc.
API
Linex class
profiler = Linex( target_cu=0, # Target compute unit shader_engine_mask="0xFFFFFFFF", # All shader engines activity=10, # Activity counter polling)Methods:
profile(command, kernel_filter=None)— run profiling
Properties:
source_lines—List[SourceLine]sorted by total_cyclesinstructions—List[InstructionData]
SourceLine
Aggregated metrics for one source code line.
line.file # Source file pathline.line_number # Line numberline.total_cycles # Sum of all instruction cyclesline.stall_cycles # Cycles spent waitingline.idle_cycles # Cycles slot was idleline.execution_count # Total executionsline.instructions # List of ISA instructionsline.stall_percent # Convenience: stall_cycles / total_cycles * 100InstructionData
Per-ISA-instruction metrics.
inst.isa # ISA instruction textinst.latency_cycles # Total cycles for this instructioninst.stall_cycles # Cycles spent waitinginst.idle_cycles # Cycles slot was idleinst.execution_count # How many times it raninst.instruction_address # Virtual address in GPU memoryinst.file # Parsed from source_location (empty without -g)inst.line # Parsed from source_location (0 without -g)inst.stall_percent # Convenience: stall_cycles / latency_cycles * 100Examples
# Find memory-bound linesmemory_bound = [ l for l in profiler.source_lines if l.stall_percent > 50]
# Find hotspots with high execution counthotspots = [ l for l in profiler.source_lines if l.execution_count > 10000]
# Instruction-level analysisfor line in profiler.source_lines[:1]: for inst in line.instructions: print(f"{inst.isa}: {inst.latency_cycles} cycles")Requirements
- Python >= 3.8
- ROCm 7.0+ with
rocprofv3