Kerncap
Kerncap profiles a running application, intercepts a target kernel dispatch, captures its complete runtime state (full device memory snapshot, kernarg buffer, HSACO), and generates a standalone reproducer that can replay the kernel in isolation using VA-faithful HSA dispatch.
How it works
1. Profile rocprofv3 --kernel-trace --stats → rank kernels by duration2. Capture HIP: LD_PRELOAD=libkerncap.so → intercept target dispatch, snapshot all tracked device memory + kernarg buffer + HSACO Triton: LD_PRELOAD capture with code-object metadata + compile-shim name mapping3. Find source HIP: __global__ grep + #include tracing Triton: @triton.jit AST match + import tracing (incl. relative imports)4. Generate Jinja2 templates → standalone .hip+Makefile or .py reproducer5. Validate Build, run reproducer, np.allclose against captured referenceInstallation
Builds libkerncap.so from source against the host ROCm 7.0+ (requires hipcc, cmake, HSA headers, rocprofiler-sdk — all present in standard ROCm images).
# From local sourcepip install .
# Editable install for developmentpip install -e .[dev]Usage
Each operation is available as both a Python API and a CLI command.
Profile
Rank kernels by total GPU execution time.
from kerncap import Kerncap
kc = Kerncap()profile = kc.profile(["./my_app", "--args"])for kernel in profile[:5]: print(f"{kernel.name}: {kernel.total_duration_ns / 1e6:.1f} ms ({kernel.percentage:.1f}%)")# Profile and print top kernelskerncap profile -- ./my_app --args
# Save profile to JSONkerncap profile --output profile.json -- ./my_appExtract
Capture a kernel’s full runtime state and generate a standalone reproducer.
# HIP kernel with source (enables recompile workflow)result = kc.extract( kernel_name="mul_mat_q", cmd=["./llama-bench", "-m", "model.gguf", "-p", "512"], source_dir="./ggml/src", output="./isolated/mul_mat_q", defines=["GGML_USE_HIP", "GGML_CUDA_FA_ALL_QUANTS"],)print(f"Output: {result.output_dir} has_source: {result.has_source}")
# Triton kernel — language auto-detected from sourceresult = kc.extract( kernel_name="flash_attn_fwd", cmd=["python", "train.py", "--batch-size", "64"], source_dir="./flash_attn", output="./isolated/flash_attn_fwd",)# HIP with sourcekerncap extract mul_mat_q --cmd "..." --source-dir ./ggml/src -D GGML_USE_HIP
# Tritonkerncap extract flash_attn_fwd --cmd "..." --source-dir ./flash_attn
# Capture-only (no source)kerncap extract mul_mat_q --cmd "..."
# Specific dispatchkerncap extract gemm_kernel --cmd "..." --dispatch 2Language detection: Kerncap auto-detects whether a kernel is HIP or Triton from
--source-dircontents. To override, pass--language hipor--language tritonon the CLI (orlanguage="triton"in the Python API).
Triton backend compatibility: The
--triton-backendflag is available for compatibility with the legacy Python capture backend. HSA is the new default, and the Python backend will soon be deprecated.
Replay
Replay a captured kernel in isolation.
# Replay with captured HSACObaseline = kc.replay("./isolated/mul_mat_q")print(f"Baseline: {baseline.timing_us:.1f} us")
# Replay with a variant HSACO and comparevariant = kc.replay("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")print(f"Variant: {variant.timing_us:.1f} us")print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")# Replay with captured HSACOkerncap replay ./isolated/mul_mat_q
# Replay with a variant HSACOkerncap replay ./isolated/mul_mat_q --hsaco optimized.hsaco
# Benchmark over multiple iterationskerncap replay ./isolated/mul_mat_q --iterations 100HIP launch mode: If replay conflicts with
rocprofv3(e.g. when profiling the reproducer itself), pass--hip-launchto use the HIP runtime launch path instead of the default HSA dispatch.
Validate
Check correctness of a reproducer or variant HSACO.
# Smoke test — confirm baseline replays without errorresult = kc.validate("./isolated/mul_mat_q")print("Passed:", result.passed)
# Correctness check — compare recompiled variant against captured baselineresult = kc.validate("./isolated/mul_mat_q", hsaco="./isolated/mul_mat_q/optimized.hsaco")print("Passed:", result.passed)
# Triton — compare against captured reference with toleranceresult = kc.validate("./isolated/flash_attn_fwd", tolerance=1e-3, rtol=1e-2)print("Passed:", result.passed)
# Edit loops auto-detect candidate.hsaco / optimized.hsacoresult = kc.validate("./isolated/flash_attn_fwd")print("Passed:", result.passed)# Smoke test — confirm baseline replays without errorkerncap validate ./isolated/mul_mat_q
# Correctness check — compare variant against captured baselinekerncap validate ./isolated/mul_mat_q --hsaco optimized.hsaco
# Triton — compare with relaxed tolerancekerncap validate ./isolated/flash_attn_fwd --tolerance 1e-3 --rtol 1e-2
# Edit loops auto-detect candidate.hsaco / optimized.hsacokerncap validate ./isolated/flash_attn_fwdValidation modes: For VA-faithful captures, baseline
validateis a smoke test only until a rebuilt HSACO is available. Passhsaco, or let kerncap auto-detectcandidate.hsaco/optimized.hsaco, to compare captured vs rebuilt execution byte-for-byte.
kerncap validate <dir> auto-detects rebuilt HSACOs from the edit loop: candidate.hsaco from Triton python3 reproducer.py, or optimized.hsaco from HIP make recompile. In that case validation runs captured vs rebuilt HSACO and reports a byte-exact memory-region summary. The CLI prints a canonical Result: PASS/FAIL (...) footer for replay and validation; per-region PASS lines are hidden unless validation fails or kerncap validate -v is used.
Extract methodology
The extract stage takes a kernel name and a runnable command and produces a fully self-contained reproducer project. Under the hood it runs three sub-stages — capture, find source, generate — each with language-specific paths for HIP and Triton.
Capture
Snapshots the full runtime state of a single kernel dispatch for later replay.
HIP kernels are captured at the HSA level. libkerncap.so (loaded via LD_PRELOAD with rocprofiler-sdk registration) hooks hsa_queue_create to install a packet intercept callback. When the target dispatch arrives, kerncap interposes a completion signal, waits for the kernel to finish, then walks the kernarg buffer. All device memory allocations are tracked via hsa_amd_memory_pool_allocate and hsa_amd_vmem_* hooks. At capture time, a full device memory snapshot is taken — every tracked allocation is D2H copied. The replay binary restores all memory at the original virtual addresses using HSA VMEM APIs, then dispatches the kernel with the captured HSACO. No DWARF metadata or argument parsing needed.
Triton kernels are captured with libkerncap.so loaded via LD_PRELOAD and a lightweight Python compile shim installed through sitecustomize.py. The shim records name_map.json rows that connect Triton’s user-facing function name, signature, constexprs, launch attributes, source snapshot, and HSACO SHA-256 to the captured dispatch. A short-lived run observer records tensor layout metadata so the editable Python reproducer can rebuild accurate torch.as_strided views; the captured kernarg buffer and memory snapshot remain the source of truth.
Find source
Locates the kernel’s source so the reproducer can compile (HIP) or import (Triton) it.
HIP: Searches the source tree for __global__ declarations matching the demangled kernel name, then traces local #include "..." directives recursively (depth 5) to collect all required headers. For projects built with CMake, kerncap uses compile_commands.json to find the translation unit and extract the exact compiler flags needed for the make recompile target.
Note:
compile_commands.jsonis optional — the capture/replay/validate workflow works without it. However, themake recompiletarget (for editing and rebuilding kernels) requires it. Generate it by adding-DCMAKE_EXPORT_COMPILE_COMMANDS=ONwhen configuring your CMake project. If missing, you’ll see: “No compile command found (compile_commands.json missing or has no entry for this file). The ‘make recompile’ target will not be available.” — basic validation still works, but you won’t be able to test kernel modifications.
Triton: Parses Python files under --source-dir with the ast module, matching @triton.jit/@triton.autotune decorators. ImportFrom nodes (including relative imports) are traced to resolve the full dependency set. Kerncap can also fall back to the source snapshots written by the compile shim when --source-dir was omitted or the original file came from a temporary torch.compile/Inductor path.
If source lookup fails, kerncap classifies the failure instead of emitting a generic warning. It distinguishes cases where DWARF source paths exist but --source-dir is wrong, likely Triton kernels were routed as HIP, or the captured code object has no source trail at all, such as Tensile, hand-written assembly, JIT-generated binaries, or third-party HSACO blobs.
Generate
The captured data and located source files are assembled into a standalone project using Jinja2 templates.
HIP kernels produce a VA-faithful replay project using kerncap-replay. The captured HSACO, kernarg buffer, and full device memory snapshot are stored in capture/. make run replays the kernel at the original virtual addresses using HSA VMEM APIs — no kernel source compilation needed.
When --source-dir is provided, kerncap additionally finds the .cu translation unit (via compile_commands.json or reverse-include search) and produces:
kernel_variant.cpp— a copy of the main kernel source file for editingdeps/— copies of all#includedependency headers (traced up to 5 levels deep)vfs.yaml— a Clang Virtual File System overlay that maps all local copies over the originals during recompilation
This enables the optimization workflow below.
| Output | Always | With --source-dir |
|---|---|---|
Captured state (capture/) | Yes | Yes |
Editable source (kernel_variant.cpp) | — | Yes |
Dependency headers (deps/) | — | Yes (when deps exist) |
VFS overlay (vfs.yaml) | — | Yes |
| Makefile | Yes | Yes |
Triton kernels produce a reproducer.py that imports the kernel from the copied source tree, loads tensor arguments from binary dumps, and calls the kernel. The generated project also includes capture/ and reference_output/; editing kernel_variant.py (or the copied source file) and rerunning python3 reproducer.py re-JITs through Triton and writes candidate.hsaco for validation.
Optimization workflow
When source_dir is provided, extract produces a self-contained project for a tight edit-recompile-validate loop:
kernel_variant.cpp Editable copy of the main kernel source filedeps/ Copies of all #include dependency headers (up to 5 levels)vfs.yaml Clang VFS overlay — maps local copies over originals at compile timecapture/ VA-faithful memory snapshot, dispatch metadata, baseline HSACOMakefile make run | make recompile | make run-variant | make validate-variantPrerequisites for
make recompile: The recompilation workflow requirescompile_commands.jsonfrom your project’s build directory. If your project was built with CMake, regenerate with-DCMAKE_EXPORT_COMPILE_COMMANDS=ONto enable this. Without it,make runandkerncap validatestill work (they use the captured HSACO), butmake recompilewon’t be available.
For Triton captures, the edit loop is Python-native rather than Makefile-based:
cd ./isolated/flash_attn_fwd# edit kernel_variant.py or the copied source filepython3 reproducer.py # re-JITs through Triton and writes candidate.hsacokerncap validate . # auto-detects candidate.hsacoimport subprocess, osfrom kerncap import Kerncap
kc = Kerncap()
# 1. Extract (once)result = kc.extract("mul_mat_q", cmd=[...], source_dir="./ggml/src", output="./isolated/mul_mat_q")reproducer_dir = result.output_dir
# 2. Edit kernel_variant.cpp or files in deps/ (do not change the kernel signature)
# 3. Recompile — single kernel, no application rebuildsubprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)
# 4. Compare baseline vs variantbaseline = kc.replay(reproducer_dir)variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))print(f"Baseline: {baseline.timing_us:.1f} us Variant: {variant.timing_us:.1f} us")print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")
# 5. Validate correctnessresult = kc.validate(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))print("Passed:", result.passed)cd ./isolated/mul_mat_q
make run # replay baseline# edit kernel_variant.cpp and/or deps/make recompile # recompile into optimized.hsacomake run-variant # replay variantkerncap validate . # auto-detects optimized.hsaco for correctness checkAI-assisted optimization
The Python API is designed for LLM-driven workflows. A Cursor agent (or any LLM with code execution) can drive the full pipeline — profile, extract, recompile, benchmark, validate — entirely through the Kerncap class without shell scripting. The key inputs to provide are:
- The application command (
cmd) - The source directory (
source_dir) and any preprocessor defines (defines) - The reproducer directory for the edit-recompile-validate loop
capture/dispatch.json is particularly useful context for an LLM: it contains the kernel’s launch configuration (grid/block dims, kernarg layout, GPU architecture) alongside the captured HSACO, giving a complete picture of what the kernel does and how it is launched before any source is read.
Technical details
Embedded device pointers
Kerncap uses VA-faithful replay: all device memory is captured in a full snapshot and restored at the original virtual addresses during replay. Embedded device pointers (e.g. T** in batched BLAS, structs with pointer members) work automatically — no pointer patching or relocation tables needed.
Module variables and constant memory
Some frameworks launch kernels through module-scope variables instead of ordinary dynamic allocations. Kokkos hip_parallel_launch_constant_memory, used by applications such as LAMMPS, SPARTA, and ExaMPM, stores kernel arguments in a __constant__-style HSACO module variable populated by hipMemcpyToSymbol. Kerncap snapshots executable module variables into module_variables.json plus binary blobs and restores only the variables whose executable SHA-256 matches the captured kernel during replay.
Large snapshots
Device-memory snapshots are streamed through a bounded staging buffer instead of copied as one full host allocation per region. The default chunk size is 64 MiB; set KERNCAP_SNAPSHOT_CHUNK_BYTES when debugging captures with very large allocations.
Triton autotuner reproducibility
Triton’s @triton.autotune selects a config by benchmarking (e.g. BLOCK_M=128, num_warps=4). Different configs change FP accumulation order, which can cause large numerical differences in FP16. Kerncap captures the winning config and pins it in the reproducer, bypassing re-tuning entirely.
If validation fails with tight tolerances, use kerncap validate --tolerance <atol> to relax the threshold.
NaN in validation output: Common causes are uninitialized device memory, FP16 overflow, or wrong dtype inference. The validator reports NaN counts per argument and sets
max_errortonan.
Validation targets
- Triton: Flash Attention forward kernel (
ROCm/flash-attention) inrocm/pytorchcontainer - HIP: Composable Kernel GEMM XDL FP16 (
ROCm/composable_kernel) inrocm/composable_kernel:ck_pytorchcontainer - HIP (embedded pointers): Batched vector scale kernel in local ROCm environment, testing T** (double-pointer) arguments via VA-faithful replay
llama.cpp/ggml kernels (template-qualified names like mul_mat_q<(ggml_type)7, 32, true>) are also supported via the -D flag for preprocessor defines.
Project structure
src/kerncap.{hip,hpp} HSA tool loaded via LD_PRELOAD (rocprofiler-sdk registration)src/replay.cpp VA-faithful HSA kernel replay binary (kerncap-replay)src/kernarg_metadata.* AMDGPU code-object kernarg metadata parserkerncap/ Python package (CLI, profiler, capturer, source finder, reproducer generator, validator)kerncap/templates/ Jinja2 templates for HIP and Triton reproducersvendor/ Vendored nlohmann/json headerstests/ Unit + integration tests