End-to-End Workflow
This guide walks through the full IntelliKit workflow: profiling a GPU application, inspecting execution, optimizing a kernel, and validating correctness.
The pipeline
Isolate → Profile → Inspect → Optimize → Validate- Isolate a kernel with Kerncap
- Profile it with Metrix (hardware counters) and Linex (source-line timing)
- Inspect execution with Nexus (assembly + HIP source)
- Optimize the kernel in isolation
- Validate correctness with Accordo
Step 1: Profile and identify the target kernel
Start with Metrix to find the hot kernels and understand their performance characteristics.
from metrix import Metrix
profiler = Metrix()results = profiler.profile( "./my_app", metrics=["memory.hbm_bandwidth_utilization"],)
for kernel in results.kernels: bw = kernel.metrics["memory.hbm_bandwidth_utilization"].avg print(f"{kernel.name}: {kernel.duration_us.avg:.2f} μs, BW util: {bw:.1f}%")Step 2: Inspect what ran on the GPU
Use Nexus to see the assembly and HIP source for each kernel.
from nexus import Nexus
trace = Nexus().run(["./my_app"])for kernel in trace: print(f"{kernel.name}: {len(kernel.assembly)} instructions")Step 3: Get source-line profiling
Use Linex to map performance to specific source lines. Compile your application with -g for source-line mapping.
from linex import Linex
profiler = Linex()profiler.profile("./my_app", kernel_filter="target_kernel")
for line in profiler.source_lines[:5]: print(f"{line.file}:{line.line_number}") print(f" {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")Step 4: Isolate and optimize
Use Kerncap to extract the kernel into a standalone reproducer, then iterate on it.
from kerncap import Kerncapimport subprocess, os
kc = Kerncap()
# Extract the kernelresult = kc.extract( "target_kernel", cmd=["./my_app"], source_dir="./src", output="./isolated/target_kernel",)reproducer_dir = result.output_dir
# Edit kernel_variant.cpp, then recompile and benchmarksubprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)
baseline = kc.replay(reproducer_dir)variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")Step 5: Validate correctness
Use Accordo to confirm the optimized kernel still produces correct results.
from accordo import Accordo
validator = Accordo(binary="./my_app", kernel_name="target_kernel")ref = validator.capture_snapshot(binary="./my_app")opt = validator.capture_snapshot(binary="./my_app_opt")result = validator.compare_snapshots(ref, opt, tolerance=1e-6)
if result.is_valid: print(f"PASS — {result.num_arrays_validated} arrays matched")else: print(result.summary())Complete example
Putting it all together:
from metrix import Metrixfrom nexus import Nexusfrom accordo import Accordo
# 1) Baseline metricsprofiler = Metrix()baseline = profiler.profile( "./app_baseline", metrics=["memory.hbm_bandwidth_utilization"],)baseline_bw = baseline.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg
# 2) See what ran on the GPUtrace = Nexus().run(["./app_baseline"])for kernel in trace: print(kernel.name, len(kernel.assembly), "instructions")
# 3) After you optimize — check correctnessvalidator = Accordo(binary="./app_baseline", kernel_name="my_kernel")ref = validator.capture_snapshot(binary="./app_baseline")opt = validator.capture_snapshot(binary="./app_opt")result = validator.compare_snapshots(ref, opt, tolerance=1e-6)
if result.is_valid: opt_results = profiler.profile( "./app_opt", metrics=["memory.hbm_bandwidth_utilization"], ) opt_bw = opt_results.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg print(f"PASS — {result.num_arrays_validated} arrays matched; BW delta {opt_bw - baseline_bw:.1f}%")