OpenAI Whisper - Speech Recognition

OpenAI Whisper - Speech Recognition#

Whisper is an automatic speech recognition (ASR) and speech translation pre-trained model. We are going to use Whisper for audio transcription.

🛠️ Supported Hardware#

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators
✅ AMD Radeon™ RX/PRO Graphics Cards
✅ AMD EPYC™ Processors
✅ AMD Ryzen™ (AI) Processors

Suggested hardware: AI PC powered by AMD Ryzen™ AI Processors

⚡ Recommended Software Environment#

Linux

Windows

🎯 Goals#

Show you how to download a model from HuggingFace
Run OpenAI Whisper on an AMD platform
Get OpenAI Whisper to transcribe an audio file

🚀 Run OpenAI Whisper on an AMD Platform#

Import the necessary packages

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from IPython.display import Audio

Load the model from Hugging Face and processor

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')

Let’s load a test audio file

Note

Dataset Download Disclaimer

By executing the next cell, you will initiate the download of the dataset `hf-internal-testing/librispeech_asr_dummy’. Please note that this dataset may include content subject to third-party ownership or licensing restrictions. By proceeding, you acknowledge and agree to the following:

You are solely responsible for reviewing and complying with any applicable terms of use, licenses, or permissions required by the dataset owner.
If explicit permission is required from the original owner or provider, you must obtain that permission before using the dataset for any purpose, including research, analysis, or redistribution.
AMD Inc. is not distributing the dataset and is providing a link solely for your convenience. AMD Inc. does not grant any rights to the dataset and disclaims all liability for misuse or unauthorized access. If you are uncertain about the licensing or permission requirements, please consult the dataset documentation or contact the dataset owner directly.

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

We are going to use the processor to generate the input features that we will feed to the model

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
print(input_features)

Let’s get the model to generate the output tokens that we can then decode with the processor function

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Compare the transcript with the actual audio

print(transcription)
Audio(data=sample['array'], rate=sample['sampling_rate'])

Let’s try with a different audio file

sample = ds[9]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)