Phi-3 Instruct Open Model#
The Phi-3-Mini-4K-Instruct is a 3.8B parameters model with 4K context length. The model is a dense decoder-only Transformer model which is fine-tuned with Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to ensure alignment with human preferences and safety guidelines. This model supports a vocabulary size of up to 32,064 tokens.
The model has been designed for general purpose AI systems and applications which require:
memory/compute constrained environments
latency bound scenarios and
strong reasoning (especially math and logic)
🛠️ Supported Hardware#
This notebook can run in a CPU or in a GPU.
✅ AMD Instinct™ Accelerators
✅ AMD Radeon™ RX/PRO Graphics Cards
⚠️ AMD EPYC™ Processors
⚠️ AMD Ryzen™ (AI) Processors
Suggested hardware: AMD Instinct™ Accelerators, this notebook can run in a CPU as well but inference is CPU will be slow.
⚡ Recommended Software Environment#
🎯 Goals#
Show you how to download a model from HuggingFace
Run Phi-3 Instruct on an AMD platform
Prompt the model and explore system and user role prompts
🚀 Run Phi-3 Instruct on an AMD Platform#
Import the necessary packages
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
Check if GPU is available for acceleration.
Note
Running the model on a GPU is strongly recommended. If your device is cpu, the model token generation will be slow.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'{device=}')
Download model and tokenizer from Hugging Face
model_id = "microsoft/Phi-3-mini-4k-instruct"
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device,
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')
Define pipeline and generation arguments. We are going to use the transformers pipeline API to create the model call and pass the user prompt.
We start by creating a pipeline object with the goal of text-generation, we also specify the model and the tokenizer.
The generation_args is a helper dictionary that we will pass to the pipeline object, we specify certain parameters, such as the max tokens, temperature (creativity of the model) and sample (if True it would select from the most likely output tokens).
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 512,
"return_full_text": False,
"temperature": 0.01,
"do_sample": False,
}
Let’s define a system prompt for our model
system_prompt = {"role": "system", "content": "You are a helpful AI assistant."}
Define a simple prompt asking about a simple math problem
prompt = [
system_prompt,
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]
Generate model response
output = pipe(prompt, **generation_args)
print(f'Prompt:\n {prompt[1]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')
The response is good, but we want the model to respond in a more concise way, for this we are going to use few-shot prompting, in this case we will do one-shot prompting. In the prompt fed to the model, we are providing an example of how we would like the response to look like.
In the system prompt we define that the model is a helpful assistant
Then we provide the user question and and example of how we would like the model to answer, the we finally include the actual question we would like the model to reply.
messages_oneshot = [
system_prompt,
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]
Generate model response
output = pipe(messages_oneshot, **generation_args)
print(f'Prompt:\n {messages_oneshot[3]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')
Note how the response from the model is more concise now.
Tip
Exercise for the readers, modify the generation_args configuration, for instance increase the value of temperature (max is 2.0) and set do_sample to True. What is the outcome?
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
SPDX-License-Identifier: MIT