Sentiment Analysis using BERT in Python

In this article, We’ll Learn Sentiment Analysis Using Pre-Trained Model BERT. For this, you need to have Intermediate knowledge of Python, little exposure to Pytorch, and Basic Knowledge of Deep Learning. We will be using the SMILE Twitter dataset for the Sentiment Analysis. Read about the Dataset and Download the dataset from this link.

Well the BERT model is using the TensorFlow library inside it already.

WHAT IS BERT?

  • BERT stands for Bidirectional Encoder Representations from Transformers.
  • Bert is a highly used machine learning model in the NLP sub-space. It is a large scale transformer-based language model that can be finetuned for a variety of tasks. You can Read about BERT from the original paper here – BERT
  • IF YOU WANT TO TRY BERT, Try it through the BERT FineTuning notebook hosted on Colab.
  • Then you can see the BERT Language model code that is available in modeling.py GITHUB repo. You can observe this model is coded in Tensorflow, Pytorch, and MXNet. 

NOTE:- USE GOOGLE COLAB AND CHANGE RUNTIME TYPE TO GPU.

LOADING AND PREPROCESSING DATA

Below is the code where we are importing all the necessary Python libraries.

#Import required Libraries 

import torch 
import numpy as np 
import os 
import random 
import pandas as pd 
from tqdm.notebook import tqdm 
from sklearn.model_selection import train_test_split 
from transformers import BertTokenizer 
from torch.utils.data import TensorDataset 
from transformers import BertForSequenceClassification 
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler 
from sklearn.metrics import f1_score 

#Loading data from Google drive 
from google.colab import drive 
drive.mount('/content/drive') 


os.chdir("ENTER LOCATION WHERE DATASET IS.") # EXAMPLE: /content/drive/My Drive/Sentiment_analysis_using_BERT 

df = pd.read_csv("smileannotationsfinal.csv", names = ['id', 'text', 'category'])

df.set_index('id', inplace = True) df.head()

OUTPUT:-

                                                                  text         category
                id		
611857364396965889	@aandraous @britishmuseum @AndrewsAntonio Merc...	nocode
614484565059596288	Dorian Gray with Rainbow Scarf #LoveWins (from...	happy
614746522043973632	@SelectShowcase @Tate_StIves ... Replace with ...	happy
614877582664835073	@Sofabsports thank you for following me back. ...	happy
611932373039644672	@britishmuseum @TudorHistory What a beautiful ...	happy

check how many values are there in the Category column.

df.category.value_counts()

OUTPUT:-

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|angry               2
sad|disgust             2
sad|disgust|angry       1
Name: category, dtype: int64

Here, you can observe there are different values with multiple names like happy|surprise. we have to remove those and values with nocode as they are useless for us.

df = df[~df.category.str.contains('\|')] # for removing values containing '|'

df = df[df.category != 'nocode']  #for removing nocode

df.category.value_counts()

OUTPUT:-

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

Our data is clean, and have 6 different classes. Now, create a dictionary having integer values respective to each class and add a new column with integer values in the dataset as well.

possible_labels = df.category.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

df['label'] = df.category.replace(label_dict)
df.head()

OUTPUT:-

                                                                    text     category  label
id			
614484565059596288	Dorian Gray with Rainbow Scarf #LoveWins (from...	happy	0
614746522043973632	@SelectShowcase @Tate_StIves ... Replace with ...	happy	0
614877582664835073	@Sofabsports thank you for following me back. ...	happy	0
611932373039644672	@britishmuseum @TudorHistory What a beautiful ...	happy	0
611570404268883969	@NationalGallery @ThePoldarkian I have always ...	happy	0

SPLITTING THE DATASET

X_train, X_val, y_train, y_val = train_test_split(df.index.values, df.label.values, test_size=0.15, random_state=17, stratify=df.label.values) 

df['data_type'] = ['not_set']*df.shape[0]   #CREATING A NEW COLUMN IN DATASET AND SETTING ALL VALUES TO 'not_set' 

df.loc[X_train, 'data_type'] ='train' #CHECKING AND SETTING data_type TO TRAIN 
df.loc[X_val, 'data_type'] = 'val' #CHECKING AND SETTING data_type TO VAL


df.groupby(['category', 'label', 'data_type']).count() #TO CHECK WHICH CATEGORY DATA IS IN WHICH data_type

 

OUTPUT:-

LOADING TOKENIZER AND ENCODING DATA

There are 9 Different Pre-trained models under BERT. These models are released under the license as the source code (Apache 2.0). we’ll useBERT-Base, Uncased Model which has 12 layers, 768 hidden, 12 heads, 110M parameters.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

#ENCODING DATA
encoded_data_train = tokenizer.batch_encode_plus(df[df.data_type=='train'].text.values,
                                                add_special_tokens=True,
                                                return_attention_mask=True,
                                                pad_to_max_length=True,
                                                max_length=256,
                                                 return_tensors='pt'
                                                )
encoded_data_val = tokenizer.batch_encode_plus(df[df.data_type=='val'].text.values,
                                                add_special_tokens=True,
                                                return_attention_mask=True,
                                                pad_to_max_length=True,
                                                max_length=256,
                                                 return_tensors='pt'
                                                )

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)

dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

SET-UP PRE-TRAINED BERT MODEL, OPTIMIZERS, AND CREATING DATA LOADERS

#SETTING MODEL
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6, output_attentions=False, output_hidden_states=False )

#CREATING DATA LOADERS
dataloader_train = DataLoader(dataset_train,sampler = RandomSampler(dataset_train), batch_size= 32)
                             

dataloader_val = DataLoader(dataset_val, sampler = RandomSampler(dataset_val), batch_size= 32)
      

#SETTING OPTIMIZERS

op = AdamW(model.parameters(),lr=1e-5,eps=1e-8)

epochs = 10

scheduler = get_linear_schedule_with_warmup(op, num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)

DEFINING PERFORMANCE METRICS

#FUNCTION TO CALCULATE F1 SCORE
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds,axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

#FUNCTION FOR CALCULATING ACCURACY PER CLASS
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v:k for k,v in label_dict.items()}
    
    preds_flat = np.argmax(preds,axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(label_dict_inverse[label])
        print("accuracy ", len(y_preds[y_preds==label])/len(y_true))

#FUNCTION FOR MODEL EVALUATION
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

TRAINING MODEL

for epoch in tqdm(range(1, epochs+1)):
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc ='Epoch {:1d}'.format(epoch),
                        leave=False,
                       disable=False
                       )
    for batch in progress_bar:
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = { 'input_ids' : batch[0],
                 'attention_mask' : batch[1],
                 'labels' : batch[2]
                 }
        outputs = model(**inputs)
        
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss':'{:.3f}'.format(loss.item()/len(batch))})
    
    # THIS SECTION OF CODE IS JUST FOR PRINTING VALUES AFTER EACH EPOCH.
    torch.save(model.state_dict(), f'BERT_ft_epoch{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 score (weighted): {val_f1}')  

OUTPUT:-

Epoch 1
Training loss: 0.4362200193107128
Validation loss: 0.5310626711164202
F1 score (weighted): 0.7817748883919411

Epoch 2
Training loss: 0.37984432056546213
Validation loss: 0.5510868259838649
F1 score (weighted): 0.7992659080726165

Epoch 3
Training loss: 0.31827573496848344
Validation loss: 0.5192299996103559
F1 score (weighted): 0.8218786445243844

Epoch 4
Training loss: 0.2796195125207305
Validation loss: 0.5170626959630421
F1 score (weighted): 0.8421060567168592

Epoch 5
Training loss: 0.2608465846627951
Validation loss: 0.5412411860057286
F1 score (weighted): 0.8309753043502524

Epoch 6
Training loss: 0.2367106368765235
Validation loss: 0.5324068495205471
F1 score (weighted): 0.8447206599506367

Epoch 7
Training loss: 0.22585051795467734
Validation loss: 0.5261891910008022
F1 score (weighted): 0.8595924339296801

Epoch 8
Training loss: 0.23260785304009915
Validation loss: 0.5234861246177128
F1 score (weighted): 0.8595924339296801

Epoch 9
Training loss: 0.228465342707932
Validation loss: 0.5233924218586513
F1 score (weighted): 0.8595924339296801

Epoch 10
Training loss: 0.22103000041097404
Validation loss: 0.5254445331437247
F1 score (weighted): 0.8595924339296801

So, here our model is ready. You can observe the  Training loss, Validation loss, and F1 score after the 10th epoch. Let’s use the evaluate and accuracy_per_class function to know how accurate our model is.

_, predictions, true_val = evaluate(dataloader_val)  #why _ ? reason behind this is evaluate function return 3 values and i don't require the 1st value i.e., loss_val_avg

accuracy_per_class(predictions, true_val)

OUTPUT:-

happy
accuracy  0.9532163742690059
not-relevant
accuracy  0.625
angry
accuracy  0.7777777777777778
disgust
accuracy  0.0
sad
accuracy  0.8
surprise
accuracy  0.4

So, here we come to the end of this article. We have learned how to use the Pre-trained model and modify it.

Thank you..!!

You can also see:

 

One response to “Sentiment Analysis using BERT in Python”

  1. Kyle says:

    hi, there are some missing variables in your code.

    batch = tuple(b.to(device) for b in batch)
    inputs = { ‘input_ids’ : batch[0],
    ‘attention_mask’ : batch[1],
    ‘labels’ : batch[2]
    }
    outputs = model(**inputs)

    no variable device. and ‘int’ object has no attribute ‘size’

    how can i fix this?

    thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *