Phi-3 Instruct Open Model#

The Phi-3-Mini-4K-Instruct is a 3.8B parameters model with 4K context length. The model is a dense decoder-only Transformer model which is fine-tuned with Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to ensure alignment with human preferences and safety guidelines. This model supports a vocabulary size of up to 32,064 tokens.

The model has been designed for general purpose AI systems and applications which require:

  • memory/compute constrained environments

  • latency bound scenarios and

  • strong reasoning (especially math and logic)

🛠️ Supported Hardware#

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators
✅ AMD Radeon™ RX/PRO Graphics Cards
⚠️ AMD EPYC™ Processors
⚠️ AMD Ryzen™ (AI) Processors

Suggested hardware: AMD Instinct™ Accelerators, this notebook can run in a CPU as well but inference is CPU will be slow.

🎯 Goals#

  • Show you how to download a model from HuggingFace

  • Run Phi-3 Instruct on an AMD platform

  • Prompt the model and explore system and user role prompts

🚀 Run Phi-3 Instruct on an AMD Platform#

Import the necessary packages

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Check if GPU is available for acceleration.

Note

Running the model on a GPU is strongly recommended. If your device is cpu, the model token generation will be slow.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'{device=}')

Download model and tokenizer from Hugging Face

model_id = "microsoft/Phi-3-mini-4k-instruct"

torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')

Define pipeline and generation arguments. We are going to use the transformers pipeline API to create the model call and pass the user prompt.

We start by creating a pipeline object with the goal of text-generation, we also specify the model and the tokenizer.

The generation_args is a helper dictionary that we will pass to the pipeline object, we specify certain parameters, such as the max tokens, temperature (creativity of the model) and sample (if True it would select from the most likely output tokens).

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 512,
    "return_full_text": False,
    "temperature": 0.01,
    "do_sample": False,
}

Let’s define a system prompt for our model

system_prompt = {"role": "system", "content": "You are a helpful AI assistant."}

Define a simple prompt asking about a simple math problem

prompt = [
    system_prompt,
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]

Generate model response

output = pipe(prompt, **generation_args) 
print(f'Prompt:\n {prompt[1]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')

The response is good, but we want the model to respond in a more concise way, for this we are going to use few-shot prompting, in this case we will do one-shot prompting. In the prompt fed to the model, we are providing an example of how we would like the response to look like.

  • In the system prompt we define that the model is a helpful assistant

  • Then we provide the user question and and example of how we would like the model to answer, the we finally include the actual question we would like the model to reply.

messages_oneshot = [
    system_prompt,
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]

Generate model response

output = pipe(messages_oneshot, **generation_args) 
print(f'Prompt:\n {messages_oneshot[3]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')

Note how the response from the model is more concise now.

Tip

Exercise for the readers, modify the generation_args configuration, for instance increase the value of temperature (max is 2.0) and set do_sample to True. What is the outcome?


Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.

SPDX-License-Identifier: MIT