Kerncap
Kerncap profiles a running application, intercepts a target kernel dispatch, captures its complete runtime state (full device memory snapshot, kernarg buffer, HSACO), and generates a standalone reproducer that can replay the kernel in isolation using VA-faithful HSA dispatch.
How it works
1. Profile rocprofv3 --kernel-trace --stats → rank kernels by duration2. Capture HIP: HSA_TOOLS_LIB=libkerncap.so → intercept target dispatch, snapshot all tracked device memory + kernarg buffer + HSACO Triton: Python-level hook on JITFunction.run → capture all tensor, scalar, and constexpr args; pin autotuner config3. Find source HIP: __global__ grep + #include tracing Triton: @triton.jit AST match + import tracing (incl. relative imports)4. Generate Jinja2 templates → standalone .hip+Makefile or .py reproducer5. Validate Build, run reproducer, np.allclose against captured referenceInstallation
Builds libkerncap.so from source against the host ROCm (requires hipcc, cmake, HSA headers — all present in standard ROCm images).
# From local sourcepip install .
# Editable install for developmentpip install -e .[dev]Usage
Each operation is available as both a Python API and a CLI command.
Profile
Rank kernels by total GPU execution time.
from kerncap import Kerncap
kc = Kerncap()profile = kc.profile(["./my_app", "--args"])for kernel in profile[:5]: print(f"{kernel.name}: {kernel.total_duration_ns / 1e6:.1f} ms ({kernel.percentage:.1f}%)")kerncap profile -- ./my_app --argskerncap profile --output profile.json -- ./my_appExtract
Capture a kernel’s full runtime state and generate a standalone reproducer.
# HIP kernel with source (enables recompile workflow)result = kc.extract( kernel_name="mul_mat_q", cmd=["./llama-bench", "-m", "model.gguf", "-p", "512"], source_dir="./ggml/src", output="./isolated/mul_mat_q", defines=["GGML_USE_HIP", "GGML_CUDA_FA_ALL_QUANTS"],)print(f"Output: {result.output_dir} has_source: {result.has_source}")
# Triton kernel — language auto-detected from sourceresult = kc.extract( kernel_name="flash_attn_fwd", cmd=["python", "train.py", "--batch-size", "64"], source_dir="./flash_attn", output="./isolated/flash_attn_fwd",)# HIP with sourcekerncap extract mul_mat_q --cmd "..." --source-dir ./ggml/src -D GGML_USE_HIP
# Tritonkerncap extract flash_attn_fwd --cmd "..." --source-dir ./flash_attn
# Capture-only (no source)kerncap extract mul_mat_q --cmd "..."
# Specific dispatchkerncap extract gemm_kernel --cmd "..." --dispatch 2Replay
Replay a captured kernel in isolation.
baseline = kc.replay("./isolated/mul_mat_q")print(f"Baseline: {baseline.timing_us:.1f} us")
variant = kc.replay("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")print(f"Variant: {variant.timing_us:.1f} us")print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")kerncap replay ./isolated/mul_mat_qkerncap replay ./isolated/mul_mat_q --hsaco optimized.hsacokerncap replay ./isolated/mul_mat_q --iterations 100Validate
Check correctness of a reproducer or variant HSACO.
# Smoke test — confirm baseline replays without errorresult = kc.validate("./isolated/mul_mat_q")print("Passed:", result.passed)
# Correctness check — compare recompiled variant against captured baselineresult = kc.validate("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")print("Passed:", result.passed)
# Triton — compare against captured reference with toleranceresult = kc.validate("./isolated/flash_attn_fwd", tolerance=1e-3, rtol=1e-2)print("Passed:", result.passed)HIP vs Triton validation: For HIP kernels, baseline
validateis a smoke test only. Passhsacoto compare a recompiled variant byte-for-byte against the captured baseline. For Triton reproducers,validatecompares outputs against captured reference data usingnp.allclose.
Optimization workflow
When source_dir is provided, extract produces a self-contained project for a tight edit-recompile-validate loop:
kernel_variant.cpp Editable copy of the main kernel source filedeps/ Copies of all #include dependency headers (up to 5 levels)vfs.yaml Clang VFS overlay — maps local copies over originals at compile timecapture/ VA-faithful memory snapshot, dispatch metadata, baseline HSACOMakefile make run | make recompile | make run-variant | make validate-variantimport subprocess, osfrom kerncap import Kerncap
kc = Kerncap()
# 1. Extract (once)result = kc.extract("mul_mat_q", cmd=[...], source_dir="./ggml/src", output="./isolated/mul_mat_q")reproducer_dir = result.output_dir
# 2. Edit kernel_variant.cpp or files in deps/ (do not change the kernel signature)
# 3. Recompile — single kernel, no application rebuildsubprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)
# 4. Compare baseline vs variantbaseline = kc.replay(reproducer_dir)variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))print(f"Baseline: {baseline.timing_us:.1f} us Variant: {variant.timing_us:.1f} us")print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")
# 5. Validate correctnessresult = kc.validate(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))print("Passed:", result.passed)cd ./isolated/mul_mat_q
make run # replay baseline# edit kernel_variant.cpp and/or deps/make recompile # recompile into optimized.hsacomake run-variant # replay variantkerncap validate . --hsaco optimized.hsaco # correctness checkTechnical details
Embedded device pointers
Kerncap uses VA-faithful replay: all device memory is captured in a full snapshot and restored at the original virtual addresses during replay. Embedded device pointers (e.g. T** in batched BLAS, structs with pointer members) work automatically — no pointer patching or relocation tables needed.
Triton autotuner reproducibility
Triton’s @triton.autotune selects a config by benchmarking (e.g. BLOCK_M=128, num_warps=4). Different configs change FP accumulation order, which can cause large numerical differences in FP16. Kerncap captures the winning config and pins it in the reproducer, bypassing re-tuning entirely.
If validation fails with tight tolerances, use kerncap validate --tolerance <atol> to relax the threshold.
Project structure
src/kerncap.{hip,hpp} HSA tool loaded via HSA_TOOLS_LIB (capture)src/replay.cpp VA-faithful HSA kernel replay binary (kerncap-replay)kerncap/ Python package (CLI, profiler, capturer, source finder, reproducer generator, validator)kerncap/templates/ Jinja2 templates for HIP and Triton reproducerstests/ Unit + integration tests