Financial Sentiment Analysis using Bert in Python

In this tutorial, we will learn how BERT helps in classifying whether text related to the finance domain is positive or negative.

What is Bert?

BERT (Bidirectional Encoder Representations from Transformers) is a new publication by Google AI Language researchers. It has created a stir in the Machine Learning field by delivering cutting-edge findings in a range of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

The Transformer encoder architecture is used by the BERT family of models to process each token of input text in the context of all tokens before and after it, thus the name: Bidirectional Encoder Representations from Transformers.

Typically, BERT models are trained on a huge corpus of text before being fine-tuned for specific tasks.

What is Bert Tokenizer?

BERT employs a tokenizer known as a Word Piece. It operates by dividing words into their complete forms (e.g., one word becomes one token) or into word parts (e.g., one word can be broken down into numerous tokens).
Where we have numerous forms of words, for example, this can be handy. As an example:

BERT embeddings are trained with two training tasks:

  1. Classification Task: to determine which category the input sentence should fall into
  2. Next Sentence Prediction Task: to determine if the second sentence naturally follows the first sentence.

Bert Tokens :

  • [PAD]:  This token is used to represent padding to the sentence.
  • [CLS]: CLS means classification. It is added at the beginning of the sentence because they require an input that may reflect the complete sentence’s meaning, they add a new tag.
  • [SEP]: SLS is added at the end of the sentence. It helps the model in understanding the end of one input and the beginning of another in the same sequence input.

Sentiment Analysis

This notebook trains a  financial sentiment analysis model to classify Sentiments as positive or negative or neutral, based on the text of the sentiment.

Dataset Link: Financial Sentiment

Content of Dataset

  • Sentence: Contains sentences related to the financial domain.
  • Sentiment: Contains sentiments like positive, negative, or neutral.

To solve this problem  we will:

  • Import all the required libraries to solve NLP problems.
  • Load the Dataset.
  • Load a BERT model from Tensorflow Hub.
  • Construct a model by combining BERT and a classifier.
  • Train your model, including BERT as part of the process.
  • Save your model and use it to categorize sentences.

Setup

First of all, we will import all the required libraries to solve Financial Sentiment Problem.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf
from nltk.util import ngrams
from collections import Counter
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM,Bidirectional,Dropout,Dense,Embedding,Dense
import re,string,unicodedata
from gensim.models import Word2Vec,KeyedVectors
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
import texthero as hero
from texthero import preprocessing as pr
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

Load Dataset

Here we are loading our dataset using pandas and we are also removing the column ‘Unnamed ‘.

df=pd.read_csv('/content/drive/MyDrive/Data/Financial Sentiment Analysis/data.csv')
print(df.shape)
del df['Unnamed: 0']
df.head()

Output

Check for NaN values in Dataset

Let’s check for NA value in the dataset.

df.isnull().sum()/len(df)*100

Output

Sentence 0.0 
Sentiment 0.0 
dtype: float64

Check for Unique Values

Let’s take a look at the unique value in the Sentiment Column.

df['Sentiment'].unique()

Output

array(['positive', 'negative', 'neutral'], dtype=object)

Visualization of Sentiments

Let’s visualize our Sentiments using the seaborn library.

y=df['Sentiment']
sns.countplot(y)

Output

Check Random Sentence

Let’s visualize some random sentences in our dataset.

df['Sentence'][0]

Output

The GeoSolutions technology will leverage Benefon 's GPS solutions by providing Location Based Search Technology , a Communities Platform , location relevant multimedia content and a new and powerful commercial model .

 

Text Preprocessing

Let’s do some text preprocessing in sentences with the help of the text hero library.

Here, We will remove the following from sentences:

  • Punctuations.
  • Brackets like (),{},[].
  • Diacritics.
  • Html tags like <a>,<p>.
  • White Space like ‘ ‘.
  • Stopwords like for, an, etc.
  • Url like HTTP.
  • Digits like 1,6 or 131ams

Code

cust_pipes=[pr.fillna,pr.lowercase,pr.remove_punctuation,pr.remove_diacritics,pr.remove_urls,
            pr.remove_brackets,pr.remove_html_tags,pr.remove_stopwords,pr.remove_whitespace
            ]
df['Sentence']=df['Sentence'].pipe(hero.clean,cust_pipes)
df['Sentence']=hero.preprocessing.remove_digits(df['Sentence'], only_blocks=False)
df.head()

Output

Remove Words

Here we are removing all the words which have length >=2.

Code

import re
def rem_words(text):
  text=re.sub(r'\b\w{1,2}\b','',text)
  return text
df['Sentence']=df['Sentence'].apply(rem_words)

Tokenization

Here we will split a string, text into a list of tokens like ‘solutions’, ‘technology’.

df['Sentence']=hero.tokenize(df['Sentence'])
df.head()

Output

Lemmatization

Lemmatization is a term that relates to performing things correctly using a vocabulary and morphological analysis of words, to remove the inflectional endings and return the base or dictionary form of a word, also known as the lemma.

Here we are applying lemmatization to our text.

Code

wc =WordNetLemmatizer()
nltk.download('omw-1.4')
def lemma(text):
  words=[wc.lemmatize(i) for i in text]
  return words
df['Sentence']=df['Sentence'].apply(lemma)
df.head()

Join Text

Finally, we are joining text which we split earlier with the help of ‘ ‘.join.

def combine(text):
  com=' '.join(text)
  return com
df['Sentence']=df['Sentence'].apply(combine)
df.head()

Output

Change Labels

Here we are changing our sentiments label i.e positive, negative, neutral to 1,2,0.

Code

label={'neutral':0,'positive':1,'negative':2}
df['Sentiment'] = df['Sentiment'].apply(lambda x: label[x])
df.head()

Output

As you can see our sentiment label is changed from text to integers.

Bert Tokenizer

Model Inputs 

  • Input_id: Input ids are frequently the only parameters that must be supplied to the model as input. They are token indices, which are numerical representations of the tokens that make up the sequences that the model will utilize as input.
  • attention_mask: When batching sequences together, the attention mask is an optional parameter. This input tells the model which tokens should and should not be attended to.

Lets Create a Vector for input_ids and attention_mask.

Code

max_len=256
x_input_id=np.zeros((len(df),max_len))
x_attn_mask=np.zeros((len(df),max_len))
x_input_id.shape,x_attn_mask.shape

Output

5842- Number of samples.

256 – Number of columns.

((5842, 256), (5842, 256))

 

Let’s create a function that will tokenize all the sentences by defining parameters inside encode_plus like :

  • max_length(256): Defining maximum length of sentence.
  • add_special_tokens(True): Tokens such as [CLS],[SEP],[PAD] are taken into consideration.
  • truncation(True): truncate each sentence to the maximum length of the model.
  • return_tensors(tf): Returning tensors in the form of TensorFlow.

This function will return input_id,attention_mask.

Applying the function to the input_ids,attention_mask which we created earlier.

from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-cased')
def data(df,ids,masks,tokenizer):
  for i,text in enumerate(df['Sentence']):
    token=tokenizer.encode_plus(text,max_length=max_len,truncation=True,padding='max_length',
                                add_special_tokens=True,return_tensors='tf')
    ids[i, :]=token.input_ids
    masks[i, :]=token.attention_mask
  return ids,masks

x_input_id,x_attn_mask=data(df,x_input_id,x_attn_mask,tokenizer)

One Hot Encoding

Let’s apply one-hot encoding on target vectors.

The image above explains how one-hot encoding works.

Code

labels=np.zeros((len(df),3))
labels[np.arange(len(df)),df['Sentiment'].values]=1
labels

Output

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

 

Defining Model

We have created a very simple fine-tuned model which includes:

  • Bert Model
  • Input_ids
  • attenton_mask
  • Bert Layer
  • Dense Layer
  • Final Output Layer

And Check the summary of the model using the model. summary()

Code

bert=TFBertModel.from_pretrained('bert-base-cased')
input_ids=tf.keras.layers.Input(shape=(256,),name='input_ids',dtype='int32')
attention_masks=tf.keras.layers.Input(shape=(256,),name='attention_mask',dtype='int32')
bert_layer=bert.bert(input_ids,attention_mask=attention_masks)[1]
denses=tf.keras.layers.Dense(512,activation='relu',name='dense_layer')(bert_layer)
output_layer=tf.keras.layers.Dense(3,activation='softmax',name='output_layer')(denses)
model=tf.keras.Model(inputs=[input_ids,attention_masks],outputs=output_layer)
model.summary()

Output

__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 256)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 256)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 256,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
                                                                                                  
 dense_layer (Dense)            (None, 512)          393728      ['bert[0][1]']                   
                                                                                                  
 output_layer (Dense)           (None, 3)            1539        ['dense_layer[0][0]']            
                                                                                                  
==================================================================================================
Total params: 108,705,539
Trainable params: 108,705,539
Non-trainable params: 0

Let’s take a look at the model’s structure.

tf.keras.utils.plot_model(classifier_model)

Train Bert Model

Here we are defining some parameters  which we are using like:

  • optimizers: Adam optimizer.
  • Loss: CategorcialCrossEntropy
  • Metrics: Accuracy.
  • Epochs: 5 (number of Iteration).

Code

optim=tf.keras.optimizers.Adam(learning_rate=1e-5,decay=1e-6)
loss=tf.keras.losses.CategoricalCrossentropy()
acc=tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optim,loss=loss,metrics=[acc])
hist=model.fit(train,validation_data=val,epochs=5)

Output

292/292 [==============================] - 268s 882ms/step - loss: 0.8007 - accuracy: 0.6383 - val_loss: 0.5455 - val_accuracy: 0.7654
Epoch 2/5
292/292 [==============================] - 257s 878ms/step - loss: 0.5333 - accuracy: 0.7568 - val_loss: 0.3816 - val_accuracy: 0.8339
Epoch 3/5
292/292 [==============================] - 256s 877ms/step - loss: 0.4292 - accuracy: 0.8044 - val_loss: 0.3283 - val_accuracy: 0.8433
Epoch 4/5
292/292 [==============================] - 256s 877ms/step - loss: 0.3263 - accuracy: 0.8440 - val_loss: 0.2264 - val_accuracy: 0.8776
Epoch 5/5
292/292 [==============================] - 256s 878ms/step - loss: 0.2821 - accuracy: 0.8658 - val_loss: 0.2385 - val_accuracy: 0.8682

The output shows that training accuracy for the 5th epoch is around .86 and for validation it also.86 and loss is also decreasing.

Visualize Train & Test Results

Let’s visualize train and validation accuracy & train and validation loss for every epoch.

history_dict = hist.history
print(history_dict.keys())

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

Output

 

Leave a Reply

Your email address will not be published.