Skip to content

Kerncap

Kerncap profiles a running application, intercepts a target kernel dispatch, captures its complete runtime state (full device memory snapshot, kernarg buffer, HSACO), and generates a standalone reproducer that can replay the kernel in isolation using VA-faithful HSA dispatch.

How it works

1. Profile rocprofv3 --kernel-trace --stats → rank kernels by duration
2. Capture HIP: HSA_TOOLS_LIB=libkerncap.so → intercept target dispatch,
snapshot all tracked device memory + kernarg buffer + HSACO
Triton: Python-level hook on JITFunction.run → capture all tensor,
scalar, and constexpr args; pin autotuner config
3. Find source HIP: __global__ grep + #include tracing
Triton: @triton.jit AST match + import tracing (incl. relative imports)
4. Generate Jinja2 templates → standalone .hip+Makefile or .py reproducer
5. Validate Build, run reproducer, np.allclose against captured reference

Installation

Builds libkerncap.so from source against the host ROCm (requires hipcc, cmake, HSA headers — all present in standard ROCm images).

Terminal window
# From local source
pip install .
# Editable install for development
pip install -e .[dev]

Usage

Each operation is available as both a Python API and a CLI command.

Profile

Rank kernels by total GPU execution time.

from kerncap import Kerncap
kc = Kerncap()
profile = kc.profile(["./my_app", "--args"])
for kernel in profile[:5]:
print(f"{kernel.name}: {kernel.total_duration_ns / 1e6:.1f} ms ({kernel.percentage:.1f}%)")
Terminal window
kerncap profile -- ./my_app --args
kerncap profile --output profile.json -- ./my_app

Extract

Capture a kernel’s full runtime state and generate a standalone reproducer.

# HIP kernel with source (enables recompile workflow)
result = kc.extract(
kernel_name="mul_mat_q",
cmd=["./llama-bench", "-m", "model.gguf", "-p", "512"],
source_dir="./ggml/src",
output="./isolated/mul_mat_q",
defines=["GGML_USE_HIP", "GGML_CUDA_FA_ALL_QUANTS"],
)
print(f"Output: {result.output_dir} has_source: {result.has_source}")
# Triton kernel — language auto-detected from source
result = kc.extract(
kernel_name="flash_attn_fwd",
cmd=["python", "train.py", "--batch-size", "64"],
source_dir="./flash_attn",
output="./isolated/flash_attn_fwd",
)
Terminal window
# HIP with source
kerncap extract mul_mat_q --cmd "..." --source-dir ./ggml/src -D GGML_USE_HIP
# Triton
kerncap extract flash_attn_fwd --cmd "..." --source-dir ./flash_attn
# Capture-only (no source)
kerncap extract mul_mat_q --cmd "..."
# Specific dispatch
kerncap extract gemm_kernel --cmd "..." --dispatch 2

Replay

Replay a captured kernel in isolation.

baseline = kc.replay("./isolated/mul_mat_q")
print(f"Baseline: {baseline.timing_us:.1f} us")
variant = kc.replay("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")
print(f"Variant: {variant.timing_us:.1f} us")
print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")
Terminal window
kerncap replay ./isolated/mul_mat_q
kerncap replay ./isolated/mul_mat_q --hsaco optimized.hsaco
kerncap replay ./isolated/mul_mat_q --iterations 100

Validate

Check correctness of a reproducer or variant HSACO.

# Smoke test — confirm baseline replays without error
result = kc.validate("./isolated/mul_mat_q")
print("Passed:", result.passed)
# Correctness check — compare recompiled variant against captured baseline
result = kc.validate("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")
print("Passed:", result.passed)
# Triton — compare against captured reference with tolerance
result = kc.validate("./isolated/flash_attn_fwd", tolerance=1e-3, rtol=1e-2)
print("Passed:", result.passed)

HIP vs Triton validation: For HIP kernels, baseline validate is a smoke test only. Pass hsaco to compare a recompiled variant byte-for-byte against the captured baseline. For Triton reproducers, validate compares outputs against captured reference data using np.allclose.

Optimization workflow

When source_dir is provided, extract produces a self-contained project for a tight edit-recompile-validate loop:

kernel_variant.cpp Editable copy of the main kernel source file
deps/ Copies of all #include dependency headers (up to 5 levels)
vfs.yaml Clang VFS overlay — maps local copies over originals at compile time
capture/ VA-faithful memory snapshot, dispatch metadata, baseline HSACO
Makefile make run | make recompile | make run-variant | make validate-variant
import subprocess, os
from kerncap import Kerncap
kc = Kerncap()
# 1. Extract (once)
result = kc.extract("mul_mat_q", cmd=[...], source_dir="./ggml/src", output="./isolated/mul_mat_q")
reproducer_dir = result.output_dir
# 2. Edit kernel_variant.cpp or files in deps/ (do not change the kernel signature)
# 3. Recompile — single kernel, no application rebuild
subprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)
# 4. Compare baseline vs variant
baseline = kc.replay(reproducer_dir)
variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))
print(f"Baseline: {baseline.timing_us:.1f} us Variant: {variant.timing_us:.1f} us")
print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")
# 5. Validate correctness
result = kc.validate(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))
print("Passed:", result.passed)
Terminal window
cd ./isolated/mul_mat_q
make run # replay baseline
# edit kernel_variant.cpp and/or deps/
make recompile # recompile into optimized.hsaco
make run-variant # replay variant
kerncap validate . --hsaco optimized.hsaco # correctness check

Technical details

Embedded device pointers

Kerncap uses VA-faithful replay: all device memory is captured in a full snapshot and restored at the original virtual addresses during replay. Embedded device pointers (e.g. T** in batched BLAS, structs with pointer members) work automatically — no pointer patching or relocation tables needed.

Triton autotuner reproducibility

Triton’s @triton.autotune selects a config by benchmarking (e.g. BLOCK_M=128, num_warps=4). Different configs change FP accumulation order, which can cause large numerical differences in FP16. Kerncap captures the winning config and pins it in the reproducer, bypassing re-tuning entirely.

If validation fails with tight tolerances, use kerncap validate --tolerance <atol> to relax the threshold.

Project structure

src/kerncap.{hip,hpp} HSA tool loaded via HSA_TOOLS_LIB (capture)
src/replay.cpp VA-faithful HSA kernel replay binary (kerncap-replay)
kerncap/ Python package (CLI, profiler, capturer, source finder,
reproducer generator, validator)
kerncap/templates/ Jinja2 templates for HIP and Triton reproducers
tests/ Unit + integration tests