Financial Sentiment Analysis using Bert in Python
In this tutorial, we will learn how BERT helps in classifying whether text related to the finance domain is positive or negative.
What is Bert?
BERT (Bidirectional Encoder Representations from Transformers) is a new publication by Google AI Language researchers. It has created a stir in the Machine Learning field by delivering cutting-edge findings in a range of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.
The Transformer encoder architecture is used by the BERT family of models to process each token of input text in the context of all tokens before and after it, thus the name: Bidirectional Encoder Representations from Transformers.
Typically, BERT models are trained on a huge corpus of text before being fine-tuned for specific tasks.
What is Bert Tokenizer?
BERT employs a tokenizer known as a Word Piece. It operates by dividing words into their complete forms (e.g., one word becomes one token) or into word parts (e.g., one word can be broken down into numerous tokens).
Where we have numerous forms of words, for example, this can be handy. As an example:
BERT embeddings are trained with two training tasks:
- Classification Task: to determine which category the input sentence should fall into
- Next Sentence Prediction Task: to determine if the second sentence naturally follows the first sentence.
Bert Tokens :
- [PAD]: This token is used to represent padding to the sentence.
- [CLS]: CLS means classification. It is added at the beginning of the sentence because they require an input that may reflect the complete sentence’s meaning, they add a new tag.
- [SEP]: SLS is added at the end of the sentence. It helps the model in understanding the end of one input and the beginning of another in the same sequence input.
Sentiment Analysis
This notebook trains a financial sentiment analysis model to classify Sentiments as positive or negative or neutral, based on the text of the sentiment.
Dataset Link: Financial Sentiment
Content of Dataset
- Sentence: Contains sentences related to the financial domain.
- Sentiment: Contains sentiments like positive, negative, or neutral.
To solve this problem we will:
- Import all the required libraries to solve NLP problems.
- Load the Dataset.
- Load a BERT model from Tensorflow Hub.
- Construct a model by combining BERT and a classifier.
- Train your model, including BERT as part of the process.
- Save your model and use it to categorize sentences.
Setup
First of all, we will import all the required libraries to solve Financial Sentiment Problem.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import tensorflow as tf from nltk.util import ngrams from collections import Counter from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.text import one_hot from tensorflow.keras.layers import LSTM,Bidirectional,Dropout,Dense,Embedding,Dense import re,string,unicodedata from gensim.models import Word2Vec,KeyedVectors from sklearn.metrics import classification_report,confusion_matrix,accuracy_score from sklearn.model_selection import train_test_split import texthero as hero from texthero import preprocessing as pr from nltk.stem import WordNetLemmatizer import nltk nltk.download('wordnet')
Load Dataset
Here we are loading our dataset using pandas and we are also removing the column ‘Unnamed ‘.
df=pd.read_csv('/content/drive/MyDrive/Data/Financial Sentiment Analysis/data.csv') print(df.shape) del df['Unnamed: 0'] df.head()
Output
Check for NaN values in Dataset
Let’s check for NA value in the dataset.
df.isnull().sum()/len(df)*100
Output
Sentence 0.0 Sentiment 0.0 dtype: float64
Check for Unique Values
Let’s take a look at the unique value in the Sentiment Column.
df['Sentiment'].unique()
Output
array(['positive', 'negative', 'neutral'], dtype=object)
Visualization of Sentiments
Let’s visualize our Sentiments using the seaborn library.
y=df['Sentiment'] sns.countplot(y)
Output
Check Random Sentence
Let’s visualize some random sentences in our dataset.
df['Sentence'][0]
Output
The GeoSolutions technology will leverage Benefon 's GPS solutions by providing Location Based Search Technology , a Communities Platform , location relevant multimedia content and a new and powerful commercial model .
Text Preprocessing
Let’s do some text preprocessing in sentences with the help of the text hero library.
Here, We will remove the following from sentences:
- Punctuations.
- Brackets like (),{},[].
- Diacritics.
- Html tags like <a>,<p>.
- White Space like ‘ ‘.
- Stopwords like for, an, etc.
- Url like HTTP.
- Digits like 1,6 or 131ams
Code
cust_pipes=[pr.fillna,pr.lowercase,pr.remove_punctuation,pr.remove_diacritics,pr.remove_urls, pr.remove_brackets,pr.remove_html_tags,pr.remove_stopwords,pr.remove_whitespace ] df['Sentence']=df['Sentence'].pipe(hero.clean,cust_pipes) df['Sentence']=hero.preprocessing.remove_digits(df['Sentence'], only_blocks=False) df.head()
Output
Remove Words
Here we are removing all the words which have length >=2.
Code
import re def rem_words(text): text=re.sub(r'\b\w{1,2}\b','',text) return text df['Sentence']=df['Sentence'].apply(rem_words)
Tokenization
Here we will split a string, text into a list of tokens like ‘solutions’, ‘technology’.
df['Sentence']=hero.tokenize(df['Sentence']) df.head()
Output
Lemmatization
Lemmatization is a term that relates to performing things correctly using a vocabulary and morphological analysis of words, to remove the inflectional endings and return the base or dictionary form of a word, also known as the lemma.
Here we are applying lemmatization to our text.
Code
wc =WordNetLemmatizer() nltk.download('omw-1.4') def lemma(text): words=[wc.lemmatize(i) for i in text] return words df['Sentence']=df['Sentence'].apply(lemma) df.head()
Join Text
Finally, we are joining text which we split earlier with the help of ‘ ‘.join.
def combine(text): com=' '.join(text) return com df['Sentence']=df['Sentence'].apply(combine) df.head()
Output
Change Labels
Here we are changing our sentiments label i.e positive, negative, neutral to 1,2,0.
Code
label={'neutral':0,'positive':1,'negative':2} df['Sentiment'] = df['Sentiment'].apply(lambda x: label[x]) df.head()
Output
As you can see our sentiment label is changed from text to integers.
Bert Tokenizer
Model Inputs
- Input_id: Input ids are frequently the only parameters that must be supplied to the model as input. They are token indices, which are numerical representations of the tokens that make up the sequences that the model will utilize as input.
- attention_mask: When batching sequences together, the attention mask is an optional parameter. This input tells the model which tokens should and should not be attended to.
Lets Create a Vector for input_ids and attention_mask.
Code
max_len=256 x_input_id=np.zeros((len(df),max_len)) x_attn_mask=np.zeros((len(df),max_len)) x_input_id.shape,x_attn_mask.shape
Output
5842- Number of samples.
256 – Number of columns.
((5842, 256), (5842, 256))
Let’s create a function that will tokenize all the sentences by defining parameters inside encode_plus like :
- max_length(256): Defining maximum length of sentence.
- add_special_tokens(True): Tokens such as [CLS],[SEP],[PAD] are taken into consideration.
- truncation(True): truncate each sentence to the maximum length of the model.
- return_tensors(tf): Returning tensors in the form of TensorFlow.
This function will return input_id,attention_mask.
Applying the function to the input_ids,attention_mask which we created earlier.
from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-cased') def data(df,ids,masks,tokenizer): for i,text in enumerate(df['Sentence']): token=tokenizer.encode_plus(text,max_length=max_len,truncation=True,padding='max_length', add_special_tokens=True,return_tensors='tf') ids[i, :]=token.input_ids masks[i, :]=token.attention_mask return ids,masks x_input_id,x_attn_mask=data(df,x_input_id,x_attn_mask,tokenizer)
One Hot Encoding
Let’s apply one-hot encoding on target vectors.
The image above explains how one-hot encoding works.
Code
labels=np.zeros((len(df),3)) labels[np.arange(len(df)),df['Sentiment'].values]=1 labels
Output
array([[0., 1., 0.], [0., 0., 1.], [0., 1., 0.], ..., [1., 0., 0.], [1., 0., 0.], [0., 1., 0.]])
Defining Model
We have created a very simple fine-tuned model which includes:
- Bert Model
- Input_ids
- attenton_mask
- Bert Layer
- Dense Layer
- Final Output Layer
And Check the summary of the model using the model. summary()
Code
bert=TFBertModel.from_pretrained('bert-base-cased') input_ids=tf.keras.layers.Input(shape=(256,),name='input_ids',dtype='int32') attention_masks=tf.keras.layers.Input(shape=(256,),name='attention_mask',dtype='int32') bert_layer=bert.bert(input_ids,attention_mask=attention_masks)[1] denses=tf.keras.layers.Dense(512,activation='relu',name='dense_layer')(bert_layer) output_layer=tf.keras.layers.Dense(3,activation='softmax',name='output_layer')(denses) model=tf.keras.Model(inputs=[input_ids,attention_masks],outputs=output_layer) model.summary()
Output
__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_ids (InputLayer) [(None, 256)] 0 [] attention_mask (InputLayer) [(None, 256)] 0 [] bert (TFBertMainLayer) TFBaseModelOutputWi 108310272 ['input_ids[0][0]', thPoolingAndCrossAt 'attention_mask[0][0]'] tentions(last_hidde n_state=(None, 256, 768), pooler_output=(Non e, 768), past_key_values=No ne, hidden_states=N one, attentions=Non e, cross_attentions =None) dense_layer (Dense) (None, 512) 393728 ['bert[0][1]'] output_layer (Dense) (None, 3) 1539 ['dense_layer[0][0]'] ================================================================================================== Total params: 108,705,539 Trainable params: 108,705,539 Non-trainable params: 0
Let’s take a look at the model’s structure.
tf.keras.utils.plot_model(classifier_model)

Train Bert Model
Here we are defining some parameters which we are using like:
- optimizers: Adam optimizer.
- Loss: CategorcialCrossEntropy
- Metrics: Accuracy.
- Epochs: 5 (number of Iteration).
Code
optim=tf.keras.optimizers.Adam(learning_rate=1e-5,decay=1e-6) loss=tf.keras.losses.CategoricalCrossentropy() acc=tf.keras.metrics.CategoricalAccuracy('accuracy') model.compile(optimizer=optim,loss=loss,metrics=[acc]) hist=model.fit(train,validation_data=val,epochs=5)
Output
292/292 [==============================] - 268s 882ms/step - loss: 0.8007 - accuracy: 0.6383 - val_loss: 0.5455 - val_accuracy: 0.7654 Epoch 2/5 292/292 [==============================] - 257s 878ms/step - loss: 0.5333 - accuracy: 0.7568 - val_loss: 0.3816 - val_accuracy: 0.8339 Epoch 3/5 292/292 [==============================] - 256s 877ms/step - loss: 0.4292 - accuracy: 0.8044 - val_loss: 0.3283 - val_accuracy: 0.8433 Epoch 4/5 292/292 [==============================] - 256s 877ms/step - loss: 0.3263 - accuracy: 0.8440 - val_loss: 0.2264 - val_accuracy: 0.8776 Epoch 5/5 292/292 [==============================] - 256s 878ms/step - loss: 0.2821 - accuracy: 0.8658 - val_loss: 0.2385 - val_accuracy: 0.8682
The output shows that training accuracy for the 5th epoch is around .86 and for validation it also.86 and loss is also decreasing.
Visualize Train & Test Results
Let’s visualize train and validation accuracy & train and validation loss for every epoch.
history_dict = hist.history print(history_dict.keys()) acc = history_dict['accuracy'] val_acc = history_dict['val_accuracy'] loss = history_dict['loss'] val_loss = history_dict['val_loss'] epochs = range(1, len(acc) + 1) fig = plt.figure(figsize=(10, 6)) fig.tight_layout() plt.subplot(2, 1, 1) # r is for "solid red line" plt.plot(epochs, loss, 'r', label='Training loss') # b is for "solid blue line" plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') # plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.subplot(2, 1, 2) plt.plot(epochs, acc, 'r', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend(loc='lower right')
Output
Hi
I’m new at this, so probably I am wrong. However going through the code under “Train Bert Model”
in “Code” line 5, when you fit the model for training, either train or val are defined in the code. can you provided the line when you split the data and define “train” and ‘”val” in order to try the code?
Thanks for your help