DistilBERT for Sentiment Analysis#
DistilBERT is a condensed version of BERT created by Hugging Face:
š 40% fewer parameters: DistilBERT is a lighter model, offering significant speed and resource advantages.
ā” 60% faster inference: Ideal for real-time applications.
š 95% of BERTās performance: Achieves near-parity on benchmarks like GLUE, making it highly efficient for natural language understanding tasks.
š ļø Supported Hardware#
This notebook can run in a CPU or in a GPU.
ā
AMD Instinct⢠Accelerators
ā
AMD Radeon⢠RX/PRO Graphics Cards
Suggested hardware: AMD Instinct⢠Accelerators, this notebook may not run in a CPU if your system does not have enough memory.
ā” Recommended Software Environment#
šÆ Goals#
Fine-tune DistilBERT, a lightweight transformer model, to perform sentiment analysis on a dataset of movie reviews.
Take advantage of DistilBERTās efficiency to achieve fast, accurate sentiment classification with fewer parameters.
š” Problem#
The goal is to accurately classify movie reviews into positive and negative sentiments.
We will:
Load and preprocess the dataset, splitting it into training, validation, and test sets.
Use the open-source
transformerslibrary from Hugging Face to tokenize text and load the model.Train DistilBERT and evaluate its performance on unseen data, tracking accuracy on the validation and test sets.
See also
Hugging Face
transformersLibrary
Documentation - Explore the open-source library used for NLP model development.Understanding BERT and DistilBERT
DistilBERT Research Paper - Read the original paper for an in-depth understanding of model distillation techniques used to create DistilBERT.
Import Packages#
Run the following cell to import all the necessary packages to be able to run training and inference using DistilBERT.
import gzip
import shutil
import pandas as pd
import requests
import os
import torch
import torch.nn.functional as F
Importing the Hugging Face transformers library for handling DistilBERT and related NLP tasks
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification
Preparing the Dataset#
We will download the movie reviews dataset (compressed in .gz format), extract it, and load it into a Pandas DataFrame for further processing.
This dataset will be used to fine-tune our DistilBERT model for sentiment analysis.
url = "https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz"
filename = os.path.join('datasets', 'movie_data', url.split("/")[-1])
if not os.path.isdir(os.path.dirname(filename)):
os.mkdir(os.path.dirname(filename))
with open(filename, "wb") as f:
r = requests.get(url)
f.write(r.content)
csv_file = filename.replace('.gz', '.csv')
with gzip.open(filename, 'rb') as f_in:
with open(csv_file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Load the extracted CSV file into a Pandas DataFrame and display the first three rows.
df = pd.read_csv(csv_file)
df.head(3)
Prepare dataset for Training#
We will prepare the dataset split into three parts: training, validation, and test sets, selecting the āreviewā texts and corresponding āsentimentā labels for each set.
Training set: First 35,000 reviews and labels
Validation set: Next 5,000 reviews and labels
Test set: Remaining reviews and labels
Finally, we printing the sizes of each dataset split
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values
val_texts = df.iloc[35000:40000]['review'].values
val_labels = df.iloc[35000:40000]['sentiment'].values
test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values
print(f'Training reviews: {len(train_texts):,}, validation reviews: {len(val_texts):,}, test reviews: {len(test_texts):,}')
Define the device for training#
Set the CPU or GPU for model training (depending on availability) and ensure reproducibility by fixing random seeds.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.backends.cudnn.deterministic = True
torch.manual_seed(123)
print(device)
Tokenize the Reviews#
With the splits ready, we will tokenize the review texts using the DistilBERT tokenizer. The idea is to convert the text data into a format that DistilBERT can understand.
Each review text is encoded into input IDs and attention masks, truncation=True ensures that sequences longer than the model max input length are truncated padding=True adds padding to shorter sequences to match the max input length within each batch. We also move the tokenization to the device (CPU or GPU) defined earlier. We do the same for the validation and test datasets.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, return_tensors="pt").to(device)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True, return_tensors="pt").to(device)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, return_tensors="pt").to(device)
Class to encapsulate the encodings and the labels#
In this section, we create a PyTorch dataset class that encapsulates the tokenized encodings and their corresponding labels for the IMDb data. This class will be used to create DataLoader objects for training and evaluation.
__getitem__: retrieve an item at a specific index__len__: return the length of the dataset
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx].clone().detach()) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx]).clone().detach()
return item
def __len__(self):
return len(self.labels)
Create the Dataloader objects that will be used in the training loop#
First, we create instances of the IMDbDataset for training, validation, and test datasets. This wraps the encodings and labels into dataset objects that can be easily used with the DataLoader.
Then, we create DataLoader objects for each split. The DataLoader will handle batching, shuffling, and parallel data loading during training and evaluation.
batch_size=16means that each batch will contain 16 samplesshuffle=Trueensures that the data is shuffled every epoch to improve model generalization, this is used for the training set only.
Validation and test sets are not shuffled, as we want to evaluate the model on the same data order each time.
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)
Get a Pretrained BERT Model and Fine-tune It#
In this section, we will load a pretrained DistilBERT model for sequence classification. We will set up the model for training and specify the optimizer. We will then train the model for a defined number of epochs, logging the loss during the training process.
We download the pretrained DistilBERT model from Hugging Faceās model hub, specifically the distilbert-base-uncased variant, which is a smaller, faster version of BERT, then we move the model to the specified device (CPU or GPU). To drive the training process, we will use the Adam optimizer. The learning rate is set to 5e-5.
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
optim = torch.optim.Adam(model.parameters(), lr=5e-5)
Model Fine-tuning Loop#
First, we define the number of epochs to train the model, set to 2 (you can increase this value). We also create an object to track the training loss, which we will use to understand how well the model is learning during training. We also set the model to training mode using model.train(), which enables track of back propagation.
In the training loop, we iterate over the training DataLoader, which provides batches of data. For each batch, we get the input IDs, attention masks, and labels, and move them to the specified device (CPU or GPU). We then invoke the model with these parameters and extract the loss and logists from the modelās output. The loss is then used to perform back propagation and update the modelās weights using the optimizer. Finally, we log the training loss for each batch and print a message every 250 batches to monitor progress.
To conclude, we compute the training time.
import time
epochs = 2
losses = []
model.train()
start_time = time.time()
for epoch in range(epochs):
for batch_idx, batch in enumerate(train_loader):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
# Backward pass
optim.zero_grad()
loss.backward()
optim.step()
losses.append(loss.item())
if batch_idx % 250 == 0:
print(f'Epoch: {epoch+1:02d}/{epochs:02d} | Batch: {batch_idx:04d}/{len(train_loader):04d} | Loss: {loss:.4f}')
train_time = time.time() - start_time
print(f'It took {train_time/60:.2f} minutes to finetune the BERT model for {epochs} epochs')
After the training, we can visualize the training loss over epochs to understand how the model learned during training. This will help us identify if the model is converging or if there are any issues like overfitting.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(losses, label='Training Loss', color='blue')
plt.title('Training Loss Over Time')
plt.xlabel('Batch Number')
plt.ylabel('Loss')
plt.legend()
plt.grid()
plt.show()
We now can compute the accuracy of the model on the different splits. To do this, we create a function that gets the model, the data loader and the device.
In the function, first we disable gradient tracking (torch.no_grad()), this saves computation but also it is not necessary for evaluation. We will keep track of the correctly predicted and the total number of examples. We iterate over the samples on the data_loader. For each batch, we get the input IDs, attention masks, and labels, and move them to the specified device (CPU or GPU). We then invoke the model with these parameters and extract the logists from the modelās output. With this information we can get the predicted_labels and compared them against the actual labels, we count the number of correct ones. Finally, we print the correctly predicted and the total samples. Finally, we return the accuracy.
def compute_accuracy(model, data_loader, device):
"""Computes the accuracy of the model on the given data loader.
Args:
model: The trained model.
data_loader: DataLoader for the dataset (train, validation, or test).
device: The device (CPU or GPU) on which the model and data are loaded.
Returns:
float: The accuracy as a percentage.
"""
with torch.no_grad():
correct_pred, num_examples = 0, 0
for _, batch in enumerate(data_loader):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs['logits'] # Get the logits from the model output
predicted_labels = torch.argmax(logits, dim=1)
num_examples += labels.size(0)
correct_pred += (predicted_labels == labels).sum()
print(f'{correct_pred=} {num_examples=}')
return correct_pred.item() / num_examples * 100
Now, we can call this function for the various splits to get the accuracy.
train_accuracy = compute_accuracy(model, train_loader, device)
val_accuracy = compute_accuracy(model, val_loader, device)
test_accuracy = compute_accuracy(model, test_loader, device)
print(f'Training accuracy: {train_accuracy:.2f}%\nValidation accuracy: {val_accuracy:.2f}%\nTest accuracy: {test_accuracy:.2f}%')
To showcase the fine-tuned model we will define a function that takes the index of a review as input and returns the sentiment prediction for that review. First, we make sure the index is within bounds. Then, we tokenize the review text and move it to the device. We then perform the sentiment prediction and return the predicted sentiment label.
from tabulate import tabulate
def sentiment_evaluation(index):
if index >= len(test_texts):
index = len(test_texts) - 1
sample_eval = tokenizer(test_texts[index], truncation=True, padding=True, return_tensors="pt").to(device)
with torch.no_grad():
logits = model(**sample_eval).logits # Get the logits from the model output
# Return the index of the highest logit value as the predicted sentiment
return logits.argmax().item()
Now, we can test the function with a few random examples. With the predicted sentiments labels we create a DataFrame to display the results, using this dataframe we can visualize the results using tabulate.
results = []
for idx in torch.randint(0, 10000, (10,)).tolist():
sent = sentiment_evaluation(idx)
actual_label = test_labels[idx]
results.append({'Index': idx, 'Predicted Sentiment': sent, 'Actual Label': actual_label})
results_df = pd.DataFrame(results)
print(tabulate(results_df, headers='keys', tablefmt='fancy_grid'))
Finally, you can manually enter the index of a review from the table above to get the corresponding sentence. Enter exit, to exit the loop.
while True:
user_input = input("Enter the index number from the table above to see the corresponding sentence or type 'exit' to quit: ")
if user_input.lower() == 'exit':
print("Exiting the program.")
break
try:
user_input_index = int(user_input)
if user_input_index in results_df['Index'].values:
predicted_sentiment_new = sentiment_evaluation(user_input_index)
actual_label_new = test_labels[user_input_index]
sentence_new = test_texts[user_input_index]
print(f'Index: {user_input_index}')
print(f'Predicted Sentiment: {predicted_sentiment_new}')
print(f'Actual Label: {actual_label_new}')
print(f'Sentence: {sentence_new}')
else:
print("The entered index is not found in the results.")
except ValueError:
print("Invalid input. Please enter a valid index number or type 'exit' to quit.")
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved. Portions of this file consist of AI-generated content.
SPDX-License-Identifier: MIT