OpenAI Whisper - Speech Recognition#
Whisper is an automatic speech recognition (ASR) and speech translation pre-trained model. We are going to use Whisper for audio transcription.
🛠️ Supported Hardware#
This notebook can run in a CPU or in a GPU.
✅ AMD Instinct™ Accelerators
✅ AMD Radeon™ RX/PRO Graphics Cards
✅ AMD EPYC™ Processors
✅ AMD Ryzen™ (AI) Processors
Suggested hardware: AI PC powered by AMD Ryzen™ AI Processors
⚡ Recommended Software Environment#
🎯 Goals#
Show you how to download a model from HuggingFace
Run OpenAI Whisper on an AMD platform
Get OpenAI Whisper to transcribe an audio file
🚀 Run OpenAI Whisper on an AMD Platform#
Import the necessary packages
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from IPython.display import Audio
Load the model from Hugging Face and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None
print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')
Let’s load a test audio file
Note
Dataset Download Disclaimer
By executing the next cell, you will initiate the download of the dataset `hf-internal-testing/librispeech_asr_dummy’. Please note that this dataset may include content subject to third-party ownership or licensing restrictions. By proceeding, you acknowledge and agree to the following:
You are solely responsible for reviewing and complying with any applicable terms of use, licenses, or permissions required by the dataset owner.
If explicit permission is required from the original owner or provider, you must obtain that permission before using the dataset for any purpose, including research, analysis, or redistribution.
AMD Inc. is not distributing the dataset and is providing a link solely for your convenience. AMD Inc. does not grant any rights to the dataset and disclaims all liability for misuse or unauthorized access. If you are uncertain about the licensing or permission requirements, please consult the dataset documentation or contact the dataset owner directly.
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
We are going to use the processor to generate the input features that we will feed to the model
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
print(input_features)
Let’s get the model to generate the output tokens that we can then decode with the processor function
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
Compare the transcript with the actual audio
print(transcription)
Audio(data=sample['array'], rate=sample['sampling_rate'])
Let’s try with a different audio file
sample = ds[9]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
Compare the transcript with the actual audio
print(transcription)
Audio(data=sample['array'], rate=sample['sampling_rate'])
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
SPDX-License-Identifier: MIT