Measuring speedup over baseline kernels

Measuring speedup over baseline kernels#

NPUEval comes with 102 canonical unoptimized kernels along with the prompts. These are useful for setting the baseline floor of performance. They also give us a concrete cycle count values which ultimately we want to minimize. In this notebook we will learn how to plot and interpret the relative speedup values of the test kernels.

Before running this notebook make sure you have ran the run_canonical.py script located in the scripts/ dir of this repository. This ensures that the baseline evaluations are available.

Goals#

Generate speedup plots using previously generated results + canonical solutions

Imports and setup#

Like in previous notebooks we’ll only use two models and a small subset of the dataset for illustration purposes.

import os
import json
import pandas as pd
import math
import matplotlib.pyplot as plt

from npueval import dataset

models = [
    "gpt-4.1", 
    "gpt-4o-mini",
]

tests = dataset[:5]

Parse results#

We’ll iterate over the same 5 kernels that were generated in the previous notebooks. You can adapt this snippet to point to your own evaluations directory and model list.

# We'll keep a list of results per model config
results = []

# Calculate stats for each model (including pass rate, vectorization score and speedup)
for model in models:
    print(model)

    # Solutions path of previously run model evaluations
    solutions_path = f"results/evaluations/{model}/"

    # Canonical kernel evaluations (make sure to run before this notebook)
    canonical_path = "../results/evaluations/canonical/"

    # Initialize totals at 0
    num_passes, total_score, total_speedup = 0, 0, 0

    # For each kernel
    for test in tests:
        eval_path = os.path.join(solutions_path, f"{test['kernel_name']}_wrapper.json")
        baseline_path = os.path.join(canonical_path, f"{test['kernel_name']}_wrapper.json")

        # Load test results
        with open(eval_path, 'r') as eval_f:
            eval_data = json.load(eval_f)

            # If functional test passed get runtime stats
            if eval_data.get('result') == "Pass":
                total_cycles = eval_data.get('total_cycles')
                total_score += eval_data.get('vector_score', 0)
                
                # Count number of passed tests
                num_passes += 1

                # Load baseline results
                with open(baseline_path, 'r') as baseline_f:
                    baseline_data = json.load(baseline_f)
                    baseline_cycles = baseline_data.get('total_cycles')
                    
                    # Calculate relative speedup over baseline
                    speedup = baseline_cycles / total_cycles

                    # Safeguard against NaN
                    if not math.isnan(speedup):
                        total_speedup += speedup

    # Compute average based on total test set
    pass_rate = num_passes / len(tests)
    avg_score = total_score / len(tests)
    avg_speedup = total_speedup / len(tests)

    results.append({
        "model": model,
        "avg_score": avg_score,
        "pass_rate": pass_rate,
        "speedup": avg_speedup
    })

# Create DataFrame
df = pd.DataFrame(results)
print(df)

gpt-4.1
gpt-4o-mini
         model  avg_score  pass_rate   speedup
0      gpt-4.1   0.011481        0.6  3.997621
1  gpt-4o-mini   0.020455        1.0  1.000000

Visualize as a bar chart#

We’ll plot the results in the same format as the original blog post

# Create bar plot from DataFrame
plt.figure(figsize=(8, 6))
plt.bar(df['model'], df['speedup'])
plt.ylabel('Speedup')
plt.xlabel('Model')
plt.title('Model Speedup Comparison')
plt.grid(axis='y', linestyle='--', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(df['speedup']):
    plt.text(i, v + 0.05, f'x{v:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

../_images/a9f599cb1905fde829226887dab73500894d077f23ac32234ac6a2a65310db4e.png

As expected the stronger GPT-4.1 model results in a bigger speedup over baseline than a mini version!