The NPUEval dataset#
This notebook shows how to load the npueval dataset and produce one code generation. We will use the NPUEval AIECoder class as a thin wrapper over OpenAI and Anthropic client APIs. We will also explore the benefits of compiler feedback to the LLM so it can correct itself and resolve hallucinations.
Goals#
Learn how to load and parse the dataset
Generate a single prompt completion (with recompilation attempts)
Parse the response to extract only the generated source code
Understanding the dataset#
NPUEval is stored in a JSON List (JSONL) file format and can be found in dataset/npueval.jsonl
. We provide a convenient wrapper to load and iterate the dataset kernels.
from npueval import dataset
Dataset format#
The dataset consists of 100+ NPU kernel prompts and test vectors.
len(dataset)
102
They’re listed in alphabetical order and the suffix denotes the primary datatype they operate with.
for kernel in dataset[:5]:
print(kernel['kernel_name'])
abs_int8
add_offset_int8
add_offset_uint8
argmax_bfloat16
argmax_int32
Each entry in the dataset has the following fields:
kernel_name – name of the kernel, matches the name in the function signature.
prompt – the core part of the dataset, this is what will be fed to an LLM to generate a solution.
canonical_solution – an optional baseline, unoptimized solution, these are used to help generate the prompts and to run tests.
program_code – the wrapper code around the kernel call that adds event tracing to help us measure performance.
test_vectors – reference input and output vectors to verify functional correctness of the LLM solution.
We can look up kernels by name using the get_by_name
method. This is useful when focusing on a single kernel for further optimization or debug.
sample = dataset.get_by_name('relu_bfloat16')
sample.keys()
dict_keys(['kernel_name', 'prompt', 'canonical_solution', 'program_code', 'test_vectors', 'tolerances'])
Prompt structure#
The prompt structure is very simple – there is a docstring which includes the description and shapes of the implemented kernel, then a function definition.
Plain prompt#
print(sample['prompt'])
/*
This AIE kernel performs a ReLU activation on a bfloat16 input vector of fixed size.
>>> relu_bfloat16([1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, -0.9765625, 0.94921875, -0.1513671875])
[1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, 0.0, 0.94921875, 0.0]
This kernel should be optimized for the following input/output buffer shapes and parameters:
in_buffer size: 256
out_buffer size: 256
*/
#include <aie_api/aie.hpp>
#include "aie_kernel_utils.h"
void relu_bfloat16(bfloat16 *in_buffer, bfloat16 *out_buffer) {
// Implementation goes here
}
Prompt + canonical solution#
We can use the canonical solution to complete this implementation just to see the full functional kernel source code should look like. Note that the canonical solution does not use any AIE APIs and is not optimized.
print(sample['prompt'][:-2]+sample['canonical_solution'])
/*
This AIE kernel performs a ReLU activation on a bfloat16 input vector of fixed size.
>>> relu_bfloat16([1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, -0.9765625, 0.94921875, -0.1513671875])
[1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, 0.0, 0.94921875, 0.0]
This kernel should be optimized for the following input/output buffer shapes and parameters:
in_buffer size: 256
out_buffer size: 256
*/
#include <aie_api/aie.hpp>
#include "aie_kernel_utils.h"
void relu_bfloat16(bfloat16 *in_buffer, bfloat16 *out_buffer) {
// Implementation goes here
constexpr int32_t num_elements = 256;
for (uint32_t i = 0; i < num_elements; ++i) {
out_buffer[i] = in_buffer[i] < 0 ? 0 : in_buffer[i];
}
}
Prompt + canonical solution + kernel wrapper#
Once we add the wrapper code we can see the full implementation. The wrapper code sets a default rounding mode to match the Python ml_dtypes implementation of bfloat16 and adds event trace markers so we can measure the runtime in cycles post-execution.
print(sample['prompt'][:-2] + sample['canonical_solution'] + sample['program_code'])
/*
This AIE kernel performs a ReLU activation on a bfloat16 input vector of fixed size.
>>> relu_bfloat16([1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, -0.9765625, 0.94921875, -0.1513671875])
[1.765625, 0.400390625, 0.98046875, 2.234375, 1.8671875, 0.0, 0.94921875, 0.0]
This kernel should be optimized for the following input/output buffer shapes and parameters:
in_buffer size: 256
out_buffer size: 256
*/
#include <aie_api/aie.hpp>
#include "aie_kernel_utils.h"
void relu_bfloat16(bfloat16 *in_buffer, bfloat16 *out_buffer) {
// Implementation goes here
constexpr int32_t num_elements = 256;
for (uint32_t i = 0; i < num_elements; ++i) {
out_buffer[i] = in_buffer[i] < 0 ? 0 : in_buffer[i];
}
}extern "C" {
void relu_bfloat16_wrapper(bfloat16 *in_buffer, bfloat16 *out_buffer) {
::aie::set_rounding(aie::rounding_mode::positive_inf);
event0();
relu_bfloat16(in_buffer, out_buffer);
event1();
}
}
Generating completions#
In this section we will explore how to generate a completion using the NPUEval built-in AIECoder agent.
AIECoder - a simple coding agent#
We provide a code generator AIECoder
class that acts as a lightweight wrapper of the openai/anthropic clients and integrates an open source single core AIE compiler.
By default it points to the default OpenAI endpoint url, but you can provide a custom base_url
if using your own solution i.e. locally hosted models with vLLM or llama.cpp.
Make sure you have OPENAI_API_KEY
or ANTHROPIC_API_KEY
set in your environment, if not then you can pass it as a parameter or use os.environ
.
from npueval.aiecoder import AIECoder
# import os
# os.environ['OPENAI_API_KEY'] = "sk-..."
# or
# AIECoder(api_key="your api key here")
coder = AIECoder()
print(coder.client)
<openai.OpenAI object at 0x763ce163b8c0>
Some of the important parameters we can pass to the coder are:
Name of LLM we want to connect to (defaults to gpt-4o)
Temperature (defaults to 0)
Number of recompilation attempts (defaults to 1)
print(f"{coder.model=}")
print(f"{coder.temperature=}")
print(f"{coder.attempts=}")
coder.model='gpt-4'
coder.temperature=0.0
coder.attempts=1
System prompt#
The AIECoder system prompt is primarily used to steer the output formatting rather than help the LLM optimize the kernels – this is because for the baseline case we just want to evaluate the base LLM capability on this code generation task. We want it to produce a single code block of only the AIE kernel C++. Without the system prompt it might try to explain step by step generating multiple C++ blocks or bash commands for compilation as well – this would be difficult for us to parse programmatically.
print(coder.system_prompt)
You are a part of a code generation system for AIE (AI Engines).
* Your job is to write C++ code for a single kernel that will run on an AIE tile.
* Produce only the C++ code for the requested kernel including any required headers and imports.
* Make sure the C++ code is complete and self contained in a single code block.
* Name the function exactly as specified in the request, and output only the kernel (no main(), examples, explanations or extra code).
Generate a single code completion#
We’ll create a coder
object that uses gpt-4o and set number of attempts=2
- this means it will try to generate another completion if the first try fails. You can set this to any number you want with context window being the limiting factor. By default the AIECoder class has attempts=1 which means it won’t try to recompile and just output the first generated response from the model.
sample = dataset.get_by_name('conv1d_int32')
coder = AIECoder(model='gpt-4o', temperature=0.4, attempts=2)
response = coder(sample['prompt'])
Print the response to see the last generated codeblock by the model:
print(response['response'])
```cpp
#include <cstdint>
void conv1d_int32(int32_t *in_buffer, int32_t *kernel, int32_t *out_buffer, uint32_t stride) {
const int in_size = 256;
const int kernel_size = 3;
const int out_size = 254; // Calculated as (in_size - kernel_size) / stride + 1
for (int i = 0; i < out_size; i += stride) {
int32_t sum = 0;
for (int j = 0; j < kernel_size; ++j) {
sum += in_buffer[i + j] * kernel[j];
}
out_buffer[i] = sum;
}
}
```
We can also look at the internal message history of what happened to get to this response:
for msg in response['history']:
print(msg['role'])
print(msg['content'])
system
You are a part of a code generation system for AIE (AI Engines).
* Your job is to write C++ code for a single kernel that will run on an AIE tile.
* Produce only the C++ code for the requested kernel including any required headers and imports.
* Make sure the C++ code is complete and self contained in a single code block.
* Name the function exactly as specified in the request, and output only the kernel (no main(), examples, explanations or extra code).
user
/*
A kernel that performs a 1D convolution operation on an input vector with a given kernel and stride as runtime parameters.
>>> conv1d_int32([2, 5, -10, -7, -7, -3, -1, 9], [1, 7, -4], 1)
[77, -37, -31, -44, -24, -46]
This kernel should be optimized for the following input/output buffer shapes and parameters:
in_buffer size: 256
kernel size: 3
out_buffer size: 254
stride: 1
*/
#include <aie_api/aie.hpp>
void conv1d_int32(int32_t *in_buffer, int32_t *kernel, int32_t *out_buffer, uint32_t stride) {
assistant
```cpp
#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
void conv1d_int32(int32_t *in_buffer, int32_t *kernel, int32_t *out_buffer, uint32_t stride) {
const int in_size = 256;
const int kernel_size = 3;
const int out_size = 254; // Calculated as (in_size - kernel_size) / stride + 1
for (int i = 0; i < out_size; i += stride) {
int32_t sum = 0;
for (int j = 0; j < kernel_size; ++j) {
sum += in_buffer[i + j] * kernel[j];
}
out_buffer[i] = sum;
}
}
```
user
Compilation failed with:
clang version 19.0.0 (https://github.com/Xilinx/llvm-aie b2a279c1939604e2ee82a651683dd995decc25ee)
Target: aie2-none-unknown-elf
Thread model: posix
InstalledDir: /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin
Build config: +assertions
Found HIP installation: /usr, version 5.7.31921
(in-process)
"/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/clang++" -cc1 -triple aie2-none-unknown-elf -emit-obj -disable-free -clear-ast-before-backend -main-file-name kernel.cc -mrelocation-model static -mframe-pointer=none -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -fno-use-init-array -mllvm -vectorize-loops=false -mllvm -vectorize-slp=false -mllvm --two-entry-phi-node-folding-threshold=10 -fno-threadsafe-statics -mllvm -mandatory-inlining-before-opt=false -mllvm -basic-aa-full-phi-analysis=true -mllvm -basic-aa-max-lookup-search-depth=10 -mllvm -enable-loop-iter-count-assumptions=true -debugger-tuning=gdb -fdebug-compilation-dir=/host/notebooks -v -fcoverage-compilation-dir=/host/notebooks -resource-dir /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/lib/clang/19 -D NDEBUG -I /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include -internal-isystem /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/aie2-none-unknown-elf/c++/v1 -internal-isystem /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/c++/v1 -include aiev2intrin.h -D__AIENGINE__ -D__AIEARCH__=20 -nostdsysteminc -internal-externc-isystem /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/aie2-none-unknown-elf -O2 -Wno-parentheses -Wno-attributes -Wno-macro-redefined -std=c++20 -fdeprecated-macro -ferror-limit 19 -fgnuc-version=4.2.1 -fno-implicit-modules -fskip-odr-check-in-gmf -fcxx-exceptions -fexceptions -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /host/notebooks/output/kernel.o -x c++ /host/notebooks/output/kernel.cc
clang -cc1 version 19.0.0 based upon LLVM 19.0.0 default target x86_64-unknown-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/aie2-none-unknown-elf/c++/v1
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/c++/v1
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/bin/../include/aie2-none-unknown-elf
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/llvm-aie/lib/clang/19/include
End of search list.
In file included from /host/notebooks/output/kernel.cc:1:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/aie.hpp:41:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/abs.hpp:63:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/abs.hpp:10:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../broadcast.hpp:13:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../accum.hpp:10:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../aie_types.hpp:19:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../detail/mdspan.hpp:12:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../detail/../iterator.hpp:12:
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../detail/../detail/array_helpers.hpp:750:9: warning: if statement has empty body [-Wempty-body]
750 | REQUIRES_MSG(elems % Elems == 0, "Array size needs to be a multiple of vector size");
| ^
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/../detail/utils.hpp:764:13: note: expanded from macro 'REQUIRES_MSG'
764 | STATIC_ASSERT_CONSTANT_EXPRESSION(a, m); \
| ^
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/../detail/utils.hpp:750:92: note: expanded from macro 'STATIC_ASSERT_CONSTANT_EXPRESSION'
750 | #define STATIC_ASSERT_CONSTANT_EXPRESSION(e, m) do { if (!chess_manifest(e)) chess_error(m); } while (0)
| ^
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/aie2/../../detail/../detail/array_helpers.hpp:750:9: note: put the semicolon on a separate line to silence this warning
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/../detail/utils.hpp:764:13: note: expanded from macro 'REQUIRES_MSG'
764 | STATIC_ASSERT_CONSTANT_EXPRESSION(a, m); \
| ^
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/detail/../detail/utils.hpp:750:92: note: expanded from macro 'STATIC_ASSERT_CONSTANT_EXPRESSION'
750 | #define STATIC_ASSERT_CONSTANT_EXPRESSION(e, m) do { if (!chess_manifest(e)) chess_error(m); } while (0)
| ^
In file included from /host/notebooks/output/kernel.cc:2:
In file included from /IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/aie_adf.hpp:32:
/IRON/mlir-aie/ironenv/lib/python3.12/site-packages/mlir_aie/include/aie_api/adf/stream.hpp:10:10: fatal error: 'adf.h' file not found
10 | #include <adf.h>
| ^~~~~~~
1 warning and 1 error generated.
assistant
```cpp
#include <cstdint>
void conv1d_int32(int32_t *in_buffer, int32_t *kernel, int32_t *out_buffer, uint32_t stride) {
const int in_size = 256;
const int kernel_size = 3;
const int out_size = 254; // Calculated as (in_size - kernel_size) / stride + 1
for (int i = 0; i < out_size; i += stride) {
int32_t sum = 0;
for (int j = 0; j < kernel_size; ++j) {
sum += in_buffer[i + j] * kernel[j];
}
out_buffer[i] = sum;
}
}
```
Looks like the model hallucinated some APIs that don’t exist, the coder fed the compiler feedback back to the LLM and the final codeblock actually compiled!
Extracting the code#
Note that the response still includes some markdown i.e. starts with “```cpp” and ends with “```”. You can use the extract_codeblock helper method to extract only the code - this is used internally by the class to send code to the compiler.
print(coder.extract_codeblock(response['response']))
#include <cstdint>
void conv1d_int32(int32_t *in_buffer, int32_t *kernel, int32_t *out_buffer, uint32_t stride) {
const int in_size = 256;
const int kernel_size = 3;
const int out_size = 254; // Calculated as (in_size - kernel_size) / stride + 1
for (int i = 0; i < out_size; i += stride) {
int32_t sum = 0;
for (int j = 0; j < kernel_size; ++j) {
sum += in_buffer[i + j] * kernel[j];
}
out_buffer[i] = sum;
}
}
Now we have a codeblock that can be easily written to a C++ source file and used by all sorts of compilers down the line.
Copyright© 2025 AMD, Inc SPDX-License-Identifier: MIT