Sentiment Analysis using BERT in Python

In this article, We’ll Learn Sentiment Analysis Using Pre-Trained Model BERT. For this, you need to have Intermediate knowledge of Python, little exposure to Pytorch, and Basic Knowledge of Deep Learning. We will be using the SMILE Twitter dataset for the Sentiment Analysis. Read about the Dataset and Download the dataset from this link.
Well the BERT model is using the TensorFlow library inside it already.
WHAT IS BERT?
- BERT stands for Bidirectional Encoder Representations from Transformers.
- Bert is a highly used machine learning model in the NLP sub-space. It is a large scale transformer-based language model that can be finetuned for a variety of tasks. You can Read about BERT from the original paper here – BERT
- IF YOU WANT TO TRY BERT, Try it through the BERT FineTuning notebook hosted on Colab.
- Then you can see the BERT Language model code that is available in modeling.py GITHUB repo. You can observe this model is coded in Tensorflow, Pytorch, and MXNet.
NOTE:- USE GOOGLE COLAB AND CHANGE RUNTIME TYPE TO GPU.
LOADING AND PREPROCESSING DATA
Below is the code where we are importing all the necessary Python libraries.
#Import required Libraries import torch import numpy as np import os import random import pandas as pd from tqdm.notebook import tqdm from sklearn.model_selection import train_test_split from transformers import BertTokenizer from torch.utils.data import TensorDataset from transformers import BertForSequenceClassification from torch.utils.data import DataLoader, RandomSampler, SequentialSampler from sklearn.metrics import f1_score #Loading data from Google drive from google.colab import drive drive.mount('/content/drive') os.chdir("ENTER LOCATION WHERE DATASET IS.") # EXAMPLE: /content/drive/My Drive/Sentiment_analysis_using_BERT df = pd.read_csv("smileannotationsfinal.csv", names = ['id', 'text', 'category']) df.set_index('id', inplace = True) df.head()
OUTPUT:-
text category id 611857364396965889 @aandraous @britishmuseum @AndrewsAntonio Merc... nocode 614484565059596288 Dorian Gray with Rainbow Scarf #LoveWins (from... happy 614746522043973632 @SelectShowcase @Tate_StIves ... Replace with ... happy 614877582664835073 @Sofabsports thank you for following me back. ... happy 611932373039644672 @britishmuseum @TudorHistory What a beautiful ... happy
check how many values are there in the Category column.
df.category.value_counts()
OUTPUT:-
nocode 1572 happy 1137 not-relevant 214 angry 57 surprise 35 sad 32 happy|surprise 11 happy|sad 9 disgust|angry 7 disgust 6 sad|angry 2 sad|disgust 2 sad|disgust|angry 1 Name: category, dtype: int64
Here, you can observe there are different values with multiple names like happy|surprise. we have to remove those and values with nocode as they are useless for us.
df = df[~df.category.str.contains('\|')] # for removing values containing '|' df = df[df.category != 'nocode'] #for removing nocode df.category.value_counts()
OUTPUT:-
happy 1137 not-relevant 214 angry 57 surprise 35 sad 32 disgust 6 Name: category, dtype: int64
Our data is clean, and have 6 different classes. Now, create a dictionary having integer values respective to each class and add a new column with integer values in the dataset as well.
possible_labels = df.category.unique() label_dict = {} for index, possible_label in enumerate(possible_labels): label_dict[possible_label] = index df['label'] = df.category.replace(label_dict) df.head()
OUTPUT:-
text category label id 614484565059596288 Dorian Gray with Rainbow Scarf #LoveWins (from... happy 0 614746522043973632 @SelectShowcase @Tate_StIves ... Replace with ... happy 0 614877582664835073 @Sofabsports thank you for following me back. ... happy 0 611932373039644672 @britishmuseum @TudorHistory What a beautiful ... happy 0 611570404268883969 @NationalGallery @ThePoldarkian I have always ... happy 0
SPLITTING THE DATASET
X_train, X_val, y_train, y_val = train_test_split(df.index.values, df.label.values, test_size=0.15, random_state=17, stratify=df.label.values) df['data_type'] = ['not_set']*df.shape[0] #CREATING A NEW COLUMN IN DATASET AND SETTING ALL VALUES TO 'not_set' df.loc[X_train, 'data_type'] ='train' #CHECKING AND SETTING data_type TO TRAIN df.loc[X_val, 'data_type'] = 'val' #CHECKING AND SETTING data_type TO VAL df.groupby(['category', 'label', 'data_type']).count() #TO CHECK WHICH CATEGORY DATA IS IN WHICH data_type

LOADING TOKENIZER AND ENCODING DATA
There are 9 Different Pre-trained models under BERT. These models are released under the license as the source code (Apache 2.0). we’ll useBERT-Base, Uncased
Model which has 12 layers, 768 hidden, 12 heads, 110M parameters.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) #ENCODING DATA encoded_data_train = tokenizer.batch_encode_plus(df[df.data_type=='train'].text.values, add_special_tokens=True, return_attention_mask=True, pad_to_max_length=True, max_length=256, return_tensors='pt' ) encoded_data_val = tokenizer.batch_encode_plus(df[df.data_type=='val'].text.values, add_special_tokens=True, return_attention_mask=True, pad_to_max_length=True, max_length=256, return_tensors='pt' ) input_ids_train = encoded_data_train['input_ids'] attention_masks_train = encoded_data_train['attention_mask'] labels_train = torch.tensor(df[df.data_type=='train'].label.values) input_ids_val = encoded_data_val['input_ids'] attention_masks_val = encoded_data_val['attention_mask'] labels_val = torch.tensor(df[df.data_type=='val'].label.values) dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train) dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)
SET-UP PRE-TRAINED BERT MODEL, OPTIMIZERS, AND CREATING DATA LOADERS
#SETTING MODEL model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6, output_attentions=False, output_hidden_states=False ) #CREATING DATA LOADERS dataloader_train = DataLoader(dataset_train,sampler = RandomSampler(dataset_train), batch_size= 32) dataloader_val = DataLoader(dataset_val, sampler = RandomSampler(dataset_val), batch_size= 32) #SETTING OPTIMIZERS op = AdamW(model.parameters(),lr=1e-5,eps=1e-8) epochs = 10 scheduler = get_linear_schedule_with_warmup(op, num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)
DEFINING PERFORMANCE METRICS
#FUNCTION TO CALCULATE F1 SCORE def f1_score_func(preds, labels): preds_flat = np.argmax(preds,axis=1).flatten() labels_flat = labels.flatten() return f1_score(labels_flat, preds_flat, average = 'weighted') #FUNCTION FOR CALCULATING ACCURACY PER CLASS def accuracy_per_class(preds, labels): label_dict_inverse = {v:k for k,v in label_dict.items()} preds_flat = np.argmax(preds,axis=1).flatten() labels_flat = labels.flatten() for label in np.unique(labels_flat): y_preds = preds_flat[labels_flat==label] y_true = labels_flat[labels_flat==label] print(label_dict_inverse[label]) print("accuracy ", len(y_preds[y_preds==label])/len(y_true)) #FUNCTION FOR MODEL EVALUATION def evaluate(dataloader_val): model.eval() loss_val_total = 0 predictions, true_vals = [], [] for batch in dataloader_val: batch = tuple(b.to(device) for b in batch) inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2], } with torch.no_grad(): outputs = model(**inputs) loss = outputs[0] logits = outputs[1] loss_val_total += loss.item() logits = logits.detach().cpu().numpy() label_ids = inputs['labels'].cpu().numpy() predictions.append(logits) true_vals.append(label_ids) loss_val_avg = loss_val_total/len(dataloader_val) predictions = np.concatenate(predictions, axis=0) true_vals = np.concatenate(true_vals, axis=0) return loss_val_avg, predictions, true_vals
TRAINING MODEL
for epoch in tqdm(range(1, epochs+1)): model.train() loss_train_total = 0 progress_bar = tqdm(dataloader_train, desc ='Epoch {:1d}'.format(epoch), leave=False, disable=False ) for batch in progress_bar: model.zero_grad() batch = tuple(b.to(device) for b in batch) inputs = { 'input_ids' : batch[0], 'attention_mask' : batch[1], 'labels' : batch[2] } outputs = model(**inputs) loss = outputs[0] loss_train_total += loss.item() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() progress_bar.set_postfix({'training_loss':'{:.3f}'.format(loss.item()/len(batch))}) # THIS SECTION OF CODE IS JUST FOR PRINTING VALUES AFTER EACH EPOCH. torch.save(model.state_dict(), f'BERT_ft_epoch{epoch}.model') tqdm.write(f'\nEpoch {epoch}') loss_train_avg = loss_train_total/len(dataloader_train) tqdm.write(f'Training loss: {loss_train_avg}') val_loss, predictions, true_vals = evaluate(dataloader_val) val_f1 = f1_score_func(predictions, true_vals) tqdm.write(f'Validation loss: {val_loss}') tqdm.write(f'F1 score (weighted): {val_f1}')
OUTPUT:-
Epoch 1 Training loss: 0.4362200193107128 Validation loss: 0.5310626711164202 F1 score (weighted): 0.7817748883919411 Epoch 2 Training loss: 0.37984432056546213 Validation loss: 0.5510868259838649 F1 score (weighted): 0.7992659080726165 Epoch 3 Training loss: 0.31827573496848344 Validation loss: 0.5192299996103559 F1 score (weighted): 0.8218786445243844 Epoch 4 Training loss: 0.2796195125207305 Validation loss: 0.5170626959630421 F1 score (weighted): 0.8421060567168592 Epoch 5 Training loss: 0.2608465846627951 Validation loss: 0.5412411860057286 F1 score (weighted): 0.8309753043502524 Epoch 6 Training loss: 0.2367106368765235 Validation loss: 0.5324068495205471 F1 score (weighted): 0.8447206599506367 Epoch 7 Training loss: 0.22585051795467734 Validation loss: 0.5261891910008022 F1 score (weighted): 0.8595924339296801 Epoch 8 Training loss: 0.23260785304009915 Validation loss: 0.5234861246177128 F1 score (weighted): 0.8595924339296801 Epoch 9 Training loss: 0.228465342707932 Validation loss: 0.5233924218586513 F1 score (weighted): 0.8595924339296801 Epoch 10 Training loss: 0.22103000041097404 Validation loss: 0.5254445331437247 F1 score (weighted): 0.8595924339296801
So, here our model is ready. You can observe the Training loss, Validation loss, and F1 score after the 10th epoch. Let’s use the evaluate and accuracy_per_class function to know how accurate our model is.
_, predictions, true_val = evaluate(dataloader_val) #why _ ? reason behind this is evaluate function return 3 values and i don't require the 1st value i.e., loss_val_avg accuracy_per_class(predictions, true_val)
OUTPUT:-
happy accuracy 0.9532163742690059 not-relevant accuracy 0.625 angry accuracy 0.7777777777777778 disgust accuracy 0.0 sad accuracy 0.8 surprise accuracy 0.4
So, here we come to the end of this article. We have learned how to use the Pre-trained model and modify it.
Thank you..!!
You can also see:
hi, there are some missing variables in your code.
batch = tuple(b.to(device) for b in batch)
inputs = { ‘input_ids’ : batch[0],
‘attention_mask’ : batch[1],
‘labels’ : batch[2]
}
outputs = model(**inputs)
no variable device. and ‘int’ object has no attribute ‘size’
how can i fix this?
thanks!