DistilBERT for Sentiment Analysis

DistilBERT for Sentiment Analysis#

DistilBERT is a condensed version of BERT created by Hugging Face:

🚀 40% fewer parameters: DistilBERT is a lighter model, offering significant speed and resource advantages.
⚡ 60% faster inference: Ideal for real-time applications.
📈 95% of BERT’s performance: Achieves near-parity on benchmarks like GLUE, making it highly efficient for natural language understanding tasks.

🛠️ Supported Hardware#

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators
✅ AMD Radeon™ RX/PRO Graphics Cards

Suggested hardware: AMD Instinct™ Accelerators, this notebook may not run in a CPU if your system does not have enough memory.

⚡ Recommended Software Environment#

Linux

🎯 Goals#

Fine-tune DistilBERT, a lightweight transformer model, to perform sentiment analysis on a dataset of movie reviews.
Take advantage of DistilBERT’s efficiency to achieve fast, accurate sentiment classification with fewer parameters.

💡 Problem#

The goal is to accurately classify movie reviews into positive and negative sentiments.
We will:
- Load and preprocess the dataset, splitting it into training, validation, and test sets.
- Use the open-source transformers library from Hugging Face to tokenize text and load the model.
- Train DistilBERT and evaluate its performance on unseen data, tracking accuracy on the validation and test sets.

Import Packages#

Run the following cell to import all the necessary packages to be able to run training and inference using DistilBERT.

import gzip
import shutil
import pandas as pd
import requests
import os
import torch
import torch.nn.functional as F

Importing the Hugging Face transformers library for handling DistilBERT and related NLP tasks

from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

Preparing the Dataset#

We will download the movie reviews dataset (compressed in .gz format), extract it, and load it into a Pandas DataFrame for further processing. This dataset will be used to fine-tune our DistilBERT model for sentiment analysis.

url = "https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz"
filename = os.path.join('datasets', 'movie_data', url.split("/")[-1])

if not os.path.isdir(os.path.dirname(filename)):
    os.mkdir(os.path.dirname(filename))

    with open(filename, "wb") as f:
        r = requests.get(url)
        f.write(r.content)

csv_file = filename.replace('.gz', '.csv')
with gzip.open(filename, 'rb') as f_in:
    with open(csv_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Load the extracted CSV file into a Pandas DataFrame and display the first three rows.

df = pd.read_csv(csv_file)

df.head(3)

Prepare dataset for Training#

We will prepare the dataset split into three parts: training, validation, and test sets, selecting the ‘review’ texts and corresponding ‘sentiment’ labels for each set.

Training set: First 35,000 reviews and labels
Validation set: Next 5,000 reviews and labels
Test set: Remaining reviews and labels

Finally, we printing the sizes of each dataset split

train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

val_texts = df.iloc[35000:40000]['review'].values
val_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

print(f'Training reviews: {len(train_texts):,}, validation reviews: {len(val_texts):,}, test reviews: {len(test_texts):,}')

Define the device for training#

Set the CPU or GPU for model training (depending on availability) and ensure reproducibility by fixing random seeds.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.backends.cudnn.deterministic = True
torch.manual_seed(123)

print(device)

Tokenize the Reviews#

With the splits ready, we will tokenize the review texts using the DistilBERT tokenizer. The idea is to convert the text data into a format that DistilBERT can understand.

Each review text is encoded into input IDs and attention masks, truncation=True ensures that sequences longer than the model max input length are truncated padding=True adds padding to shorter sequences to match the max input length within each batch. We also move the tokenization to the device (CPU or GPU) defined earlier. We do the same for the validation and test datasets.

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, return_tensors="pt").to(device)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True, return_tensors="pt").to(device)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, return_tensors="pt").to(device)

Class to encapsulate the encodings and the labels#

In this section, we create a PyTorch dataset class that encapsulates the tokenized encodings and their corresponding labels for the IMDb data. This class will be used to create DataLoader objects for training and evaluation.

__getitem__: retrieve an item at a specific index
__len__: return the length of the dataset

class IMDbDataset(torch.utils.data.Dataset):

    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx].clone().detach()) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx]).clone().detach()
        return item

    def __len__(self):
        return len(self.labels)

Create the `Dataloader` objects that will be used in the training loop#

First, we create instances of the IMDbDataset for training, validation, and test datasets. This wraps the encodings and labels into dataset objects that can be easily used with the DataLoader.

Then, we create DataLoader objects for each split. The DataLoader will handle batching, shuffling, and parallel data loading during training and evaluation.

batch_size=16 means that each batch will contain 16 samples
shuffle=True ensures that the data is shuffled every epoch to improve model generalization, this is used for the training set only.

Validation and test sets are not shuffled, as we want to evaluate the model on the same data order each time.

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

Get a Pretrained BERT Model and Fine-tune It#

In this section, we will load a pretrained DistilBERT model for sequence classification. We will set up the model for training and specify the optimizer. We will then train the model for a defined number of epochs, logging the loss during the training process.

We download the pretrained DistilBERT model from Hugging Face’s model hub, specifically the distilbert-base-uncased variant, which is a smaller, faster version of BERT, then we move the model to the specified device (CPU or GPU). To drive the training process, we will use the Adam optimizer. The learning rate is set to 5e-5.

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

model.to(device)

optim = torch.optim.Adam(model.parameters(), lr=5e-5)

Model Fine-tuning Loop#

First, we define the number of epochs to train the model, set to 2 (you can increase this value). We also create an object to track the training loss, which we will use to understand how well the model is learning during training. We also set the model to training mode using model.train(), which enables track of back propagation.

In the training loop, we iterate over the training DataLoader, which provides batches of data. For each batch, we get the input IDs, attention masks, and labels, and move them to the specified device (CPU or GPU). We then invoke the model with these parameters and extract the loss and logists from the model’s output. The loss is then used to perform back propagation and update the model’s weights using the optimizer. Finally, we log the training loss for each batch and print a message every 250 batches to monitor progress.

To conclude, we compute the training time.

import time

epochs = 2

losses = []
model.train()

start_time = time.time()
for epoch in range(epochs):
    for batch_idx, batch in enumerate(train_loader):

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = outputs['loss'], outputs['logits']

        # Backward pass
        optim.zero_grad()
        loss.backward()
        optim.step()
        losses.append(loss.item())

        if batch_idx % 250 == 0:
            print(f'Epoch: {epoch+1:02d}/{epochs:02d} | Batch: {batch_idx:04d}/{len(train_loader):04d} | Loss: {loss:.4f}')

train_time = time.time() - start_time
print(f'It took {train_time/60:.2f} minutes to finetune the BERT model for {epochs} epochs')

After the training, we can visualize the training loss over epochs to understand how the model learned during training. This will help us identify if the model is converging or if there are any issues like overfitting.

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(losses, label='Training Loss', color='blue')
plt.title('Training Loss Over Time')
plt.xlabel('Batch Number')
plt.ylabel('Loss')
plt.legend()
plt.grid()
plt.show()

We now can compute the accuracy of the model on the different splits. To do this, we create a function that gets the model, the data loader and the device.

In the function, first we disable gradient tracking (torch.no_grad()), this saves computation but also it is not necessary for evaluation. We will keep track of the correctly predicted and the total number of examples. We iterate over the samples on the data_loader. For each batch, we get the input IDs, attention masks, and labels, and move them to the specified device (CPU or GPU). We then invoke the model with these parameters and extract the logists from the model’s output. With this information we can get the predicted_labels and compared them against the actual labels, we count the number of correct ones. Finally, we print the correctly predicted and the total samples. Finally, we return the accuracy.

def compute_accuracy(model, data_loader, device):
    """Computes the accuracy of the model on the given data loader.

    Args:
        model: The trained model.
        data_loader: DataLoader for the dataset (train, validation, or test).
        device: The device (CPU or GPU) on which the model and data are loaded.

    Returns:
        float: The accuracy as a percentage.
    """
    with torch.no_grad():
        correct_pred, num_examples = 0, 0
        for _, batch in enumerate(data_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs['logits']  # Get the logits from the model output
            predicted_labels = torch.argmax(logits, dim=1)

            num_examples += labels.size(0)
            correct_pred += (predicted_labels == labels).sum()

    print(f'{correct_pred=} {num_examples=}')
    return correct_pred.item() / num_examples * 100

Now, we can call this function for the various splits to get the accuracy.

train_accuracy = compute_accuracy(model, train_loader, device)
val_accuracy = compute_accuracy(model, val_loader, device)
test_accuracy = compute_accuracy(model, test_loader, device)

print(f'Training accuracy: {train_accuracy:.2f}%\nValidation accuracy: {val_accuracy:.2f}%\nTest accuracy: {test_accuracy:.2f}%')

To showcase the fine-tuned model we will define a function that takes the index of a review as input and returns the sentiment prediction for that review. First, we make sure the index is within bounds. Then, we tokenize the review text and move it to the device. We then perform the sentiment prediction and return the predicted sentiment label.

from tabulate import tabulate

def sentiment_evaluation(index):
    if index >= len(test_texts):
        index = len(test_texts) - 1

    sample_eval = tokenizer(test_texts[index], truncation=True, padding=True, return_tensors="pt").to(device)

    with torch.no_grad():
        logits = model(**sample_eval).logits  # Get the logits from the model output

    # Return the index of the highest logit value as the predicted sentiment
    return logits.argmax().item()

Now, we can test the function with a few random examples. With the predicted sentiments labels we create a DataFrame to display the results, using this dataframe we can visualize the results using tabulate.

results = []

for idx in torch.randint(0, 10000, (10,)).tolist():
    sent = sentiment_evaluation(idx)
    actual_label = test_labels[idx]
    results.append({'Index': idx, 'Predicted Sentiment': sent, 'Actual Label': actual_label})

results_df = pd.DataFrame(results)
print(tabulate(results_df, headers='keys', tablefmt='fancy_grid'))

Finally, you can manually enter the index of a review from the table above to get the corresponding sentence. Enter exit, to exit the loop.

while True:
    user_input = input("Enter the index number from the table above to see the corresponding sentence or type 'exit' to quit: ")

    if user_input.lower() == 'exit':
        print("Exiting the program.")
        break

    try:
        user_input_index = int(user_input)

        if user_input_index in results_df['Index'].values:
            predicted_sentiment_new = sentiment_evaluation(user_input_index)
            actual_label_new = test_labels[user_input_index]
            sentence_new = test_texts[user_input_index]

            print(f'Index: {user_input_index}')
            print(f'Predicted Sentiment: {predicted_sentiment_new}')
            print(f'Actual Label: {actual_label_new}')
            print(f'Sentence: {sentence_new}')
        else:
            print("The entered index is not found in the results.")
    except ValueError:
        print("Invalid input. Please enter a valid index number or type 'exit' to quit.")

SPDX-License-Identifier: MIT