End-to-End Workflow

This guide walks through the full IntelliKit workflow: profiling a GPU application, inspecting execution, optimizing a kernel, and validating correctness.

The pipeline

Isolate → Profile → Inspect → Optimize → Validate

Isolate a kernel with Kerncap
Profile it with Metrix (hardware counters) and Linex (source-line timing)
Inspect execution with Nexus (assembly + HIP source)
Optimize the kernel in isolation
Validate correctness with Accordo

Step 1: Profile and identify the target kernel

Start with Metrix to find the hot kernels and understand their performance characteristics.

from metrix import Metrix

profiler = Metrix()
results = profiler.profile(
    "./my_app",
    metrics=["memory.hbm_bandwidth_utilization"],
)

for kernel in results.kernels:
    bw = kernel.metrics["memory.hbm_bandwidth_utilization"].avg
    print(f"{kernel.name}: {kernel.duration_us.avg:.2f} μs, BW util: {bw:.1f}%")

Step 2: Inspect what ran on the GPU

Use Nexus to see the assembly and HIP source for each kernel.

from nexus import Nexus

trace = Nexus().run(["./my_app"])
for kernel in trace:
    print(f"{kernel.name}: {len(kernel.assembly)} instructions")

Step 3: Get source-line profiling

Use Linex to map performance to specific source lines. Compile your application with -g for source-line mapping.

from linex import Linex

profiler = Linex()
profiler.profile("./my_app", kernel_filter="target_kernel")

for line in profiler.source_lines[:5]:
    print(f"{line.file}:{line.line_number}")
    print(f"  {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")

Step 4: Isolate and optimize

Use Kerncap to extract the kernel into a standalone reproducer, then iterate on it.

from kerncap import Kerncap
import subprocess, os

kc = Kerncap()

# Extract the kernel
result = kc.extract(
    "target_kernel",
    cmd=["./my_app"],
    source_dir="./src",
    output="./isolated/target_kernel",
)
reproducer_dir = result.output_dir

# Edit kernel_variant.cpp, then recompile and benchmark
subprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)

baseline = kc.replay(reproducer_dir)
variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))
print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")

Step 5: Validate correctness

Use Accordo to confirm the optimized kernel still produces correct results.

from accordo import Accordo

validator = Accordo(binary="./my_app", kernel_name="target_kernel")
ref = validator.capture_snapshot(binary="./my_app")
opt = validator.capture_snapshot(binary="./my_app_opt")
result = validator.compare_snapshots(ref, opt, tolerance=1e-6)

if result.is_valid:
    print(f"PASS — {result.num_arrays_validated} arrays matched")
else:
    print(result.summary())

Complete example

Putting it all together:

from metrix import Metrix
from nexus import Nexus
from accordo import Accordo

# 1) Baseline metrics
profiler = Metrix()
baseline = profiler.profile(
    "./app_baseline",
    metrics=["memory.hbm_bandwidth_utilization"],
)
baseline_bw = baseline.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg

# 2) See what ran on the GPU
trace = Nexus().run(["./app_baseline"])
for kernel in trace:
    print(kernel.name, len(kernel.assembly), "instructions")

# 3) After you optimize — check correctness
validator = Accordo(binary="./app_baseline", kernel_name="my_kernel")
ref = validator.capture_snapshot(binary="./app_baseline")
opt = validator.capture_snapshot(binary="./app_opt")
result = validator.compare_snapshots(ref, opt, tolerance=1e-6)

if result.is_valid:
    opt_results = profiler.profile(
        "./app_opt",
        metrics=["memory.hbm_bandwidth_utilization"],
    )
    opt_bw = opt_results.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg
    print(f"PASS — {result.num_arrays_validated} arrays matched; BW delta {opt_bw - baseline_bw:.1f}%")