Skip to content

End-to-End Workflow

This guide walks through the full IntelliKit workflow: profiling a GPU application, inspecting execution, optimizing a kernel, and validating correctness.

The pipeline

Isolate → Profile → Inspect → Optimize → Validate
  1. Isolate a kernel with Kerncap
  2. Profile it with Metrix (hardware counters) and Linex (source-line timing)
  3. Inspect execution with Nexus (assembly + HIP source)
  4. Optimize the kernel in isolation
  5. Validate correctness with Accordo

Step 1: Profile and identify the target kernel

Start with Metrix to find the hot kernels and understand their performance characteristics.

from metrix import Metrix
profiler = Metrix()
results = profiler.profile(
"./my_app",
metrics=["memory.hbm_bandwidth_utilization"],
)
for kernel in results.kernels:
bw = kernel.metrics["memory.hbm_bandwidth_utilization"].avg
print(f"{kernel.name}: {kernel.duration_us.avg:.2f} μs, BW util: {bw:.1f}%")

Step 2: Inspect what ran on the GPU

Use Nexus to see the assembly and HIP source for each kernel.

from nexus import Nexus
trace = Nexus().run(["./my_app"])
for kernel in trace:
print(f"{kernel.name}: {len(kernel.assembly)} instructions")

Step 3: Get source-line profiling

Use Linex to map performance to specific source lines. Compile your application with -g for source-line mapping.

from linex import Linex
profiler = Linex()
profiler.profile("./my_app", kernel_filter="target_kernel")
for line in profiler.source_lines[:5]:
print(f"{line.file}:{line.line_number}")
print(f" {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")

Step 4: Isolate and optimize

Use Kerncap to extract the kernel into a standalone reproducer, then iterate on it.

from kerncap import Kerncap
import subprocess, os
kc = Kerncap()
# Extract the kernel
result = kc.extract(
"target_kernel",
cmd=["./my_app"],
source_dir="./src",
output="./isolated/target_kernel",
)
reproducer_dir = result.output_dir
# Edit kernel_variant.cpp, then recompile and benchmark
subprocess.run(["make", "recompile"], cwd=reproducer_dir, check=True)
baseline = kc.replay(reproducer_dir)
variant = kc.replay(reproducer_dir, hsaco=os.path.join(reproducer_dir, "optimized.hsaco"))
print(f"Speedup: {baseline.timing_us / variant.timing_us:.2f}x")

Step 5: Validate correctness

Use Accordo to confirm the optimized kernel still produces correct results.

from accordo import Accordo
validator = Accordo(binary="./my_app", kernel_name="target_kernel")
ref = validator.capture_snapshot(binary="./my_app")
opt = validator.capture_snapshot(binary="./my_app_opt")
result = validator.compare_snapshots(ref, opt, tolerance=1e-6)
if result.is_valid:
print(f"PASS — {result.num_arrays_validated} arrays matched")
else:
print(result.summary())

Complete example

Putting it all together:

from metrix import Metrix
from nexus import Nexus
from accordo import Accordo
# 1) Baseline metrics
profiler = Metrix()
baseline = profiler.profile(
"./app_baseline",
metrics=["memory.hbm_bandwidth_utilization"],
)
baseline_bw = baseline.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg
# 2) See what ran on the GPU
trace = Nexus().run(["./app_baseline"])
for kernel in trace:
print(kernel.name, len(kernel.assembly), "instructions")
# 3) After you optimize — check correctness
validator = Accordo(binary="./app_baseline", kernel_name="my_kernel")
ref = validator.capture_snapshot(binary="./app_baseline")
opt = validator.capture_snapshot(binary="./app_opt")
result = validator.compare_snapshots(ref, opt, tolerance=1e-6)
if result.is_valid:
opt_results = profiler.profile(
"./app_opt",
metrics=["memory.hbm_bandwidth_utilization"],
)
opt_bw = opt_results.kernels[0].metrics["memory.hbm_bandwidth_utilization"].avg
print(f"PASS — {result.num_arrays_validated} arrays matched; BW delta {opt_bw - baseline_bw:.1f}%")