Sarcasm Detection with Glove Embeddings using TensorFlow
In this tutorial, we will learn about how to use glove embeddings in the LSTM model to find out whether particular news is sarcastic or not using the TensorFlow deep learning module in Python programming.
You will see the step-by-step code…
What is Glove?
The GloVe is an unsupervised learning technique that generates word vector representations. The resultant representations highlight intriguing linear substructures of the word vector space, and training is based on aggregated global word-word co-occurrence statistics from a corpus.
What is LSTM?
LSTMs are specifically developed to prevent the problem of long-term reliance. They don’t have to work hard to remember knowledge for lengthy periods; it’s like second nature to them!
LSTMs have a chain-like structure as well, but the repeating module is different. Instead of a single neural network layer, there are four, each of which interacts uniquely.
Download the Dataset: Sarcasm Detection
Content of Dataset:
- is_sarcastic: 1 if the record is sarcastic otherwise 0.
- headline: the headline of the news article.
- article_link: link to the original news article.
Import all the required Library
First of all, let’s import all the required Python libraries in order to solve a classification problem:
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from gensim.models import Word2Vec,KeyedVectors import nltk nltk.download('stopwords') nltk.download('wordnet') from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer from nltk.util import ngrams from bs4 import BeautifulSoup from collections import Counter from wordcloud import WordCloud,STOPWORDS from nltk.stem import WordNetLemmatizer import re,string,unicodedata from sklearn.metrics import classification_report,confusion_matrix,accuracy_score from sklearn.model_selection import train_test_split from string import punctuation import tensorflow as tf from tensorflow.keras.layers import Embedding from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.text import one_hot from tensorflow.keras.layers import LSTM,Bidirectional,Dropout from tensorflow.keras.layers import Dense
Reading File
Let’s read the JSON file to a data frame with the help of Pandas library:
df=pd.read_json(r'D:/Sentiment Analysis/sarcasam.json',lines=True) df.head()
Output
Check Distribution of sarcastic values
Let’s check the frequency distribution of 1 or 0.
sns.countplot(df['is_sarcastic']) y=df['is_sarcastic'] # These are the class weights for each class class_1 =( len(y) - len(y[y==0]))/len(y) class_2 =( len(y) - len(y[y==1]))/len(y) print(class_1,class_2)
Output
0.476396799329117 0.523603200670883
Text Preprocessing
In-text pre-processing we are making our text lowercase, removing square brackets in between text, removing links from text, removing punctuation from text, removing numbers in the text.
We apply text pre-processing to the headline column of Dataframe. See the Python code below:
stop = stopwords.words('english') wc =WordNetLemmatizer() def clean_text(text): '''Make text lowercase, remove text in square brackets,remove links,remove punctuation and remove words containing numbers,Remove Stopwords,then Lemmatize its.''' text = text.lower() text = re.sub('\[.*?\]', '', text) text = re.sub('https?://\S+|www\.\S+', '', text) text = re.sub('<.*?>+', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('\n', '', text) text = re.sub('\w*\d\w*', '', text) return text # Applying the cleaning function to both test and training datasets df['headline'] = df['headline'].apply(lambda x: clean_text(x)) df.head(3)
Output
Let’s check the random row after text preprocessing
df['headline'][0]
Output
'thirtysomething scientists unveil doomsday clock of hair loss'
Removing Stopwords from Text
Text may contain stop words like ‘the’, ‘is’, ‘are’, ‘to’, ‘and’, ‘etc’…
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words.
Here we are removing stopwords from text
def remove_stop(text): final_text = [] for i in text.split(): if i.strip().lower() not in stop: final_text.append(i.strip()) return " ".join(final_text) # Applying the cleaning function to both test and training datasets df['headline'] = df['headline'].apply(lambda x: remove_stop(x)) df.head(3)
Output
Let’s check the random row after stopwords removal
df['headline'][0]
Output
'thirtysomething scientists unveil doomsday clock hair loss'
Most Frequent Words
Let’s analyze the most frequent words in our text with the help of Counter.
def counter_wrd(text): cnt=Counter() for i in text.values: for word in i.split(): cnt[word]+=1 return cnt text=df.headline #Frequency counter=counter_wrd(text) counter.most_common(10)
Output
[('new', 1677), ('trump', 1389), ('man', 1373), ('report', 604), ('us', 601), ('one', 555), ('woman', 505), ('area', 494), ('says', 485), ('day', 475)]
Visualize Sarcastic Words in headlines
Let’s visualize sarcastic words i.e 1 with the help of word cloud.
plt.figure(figsize=(10,20)) wc=WordCloud(max_words=2000,width=1600,height=800,stopwords=STOPWORDS).generate(" ".join(df[df.is_sarcastic==1].headline)) plt.imshow(wc , interpolation = 'bilinear')
Output
Visualize Non-Sarcastic Words
Let’s visualize nonsarcastic words i.e 0 with the help of word cloud.
plt.figure(figsize=(10,20)) wc=WordCloud(max_words=2000,width=1600,height=800,stopwords=STOPWORDS).generate(" ".join(df[df.is_sarcastic==0].headline)) plt.imshow(wc,interpolation='bilinear')
Output
Separating Words
Let’s separate the words using split()
words=[] for i in df.headline: words.append(i.split()) words[:2]
Output
[['thirtysomething', 'scientists', 'unveil', 'doomsday', 'clock', 'hair', 'loss'], ['dem', 'rep', 'totally', 'nails', 'congress', 'falling', 'short', 'gender', 'racial', 'equality']]
Check Vocab Size
Let’s find out the vocabulary size using Word2Vec
model=Word2Vec(sentences=words,min_count=1,vector_size=100,window=5) #vocab size len(model.wv)
Output
28630
Tokenizer – Convert text into numbers
Let’s convert the word into a numerical representation form.
tokenizer = Tokenizer(num_words=35000) tokenizer.fit_on_texts(words) tokenized_ind = tokenizer.texts_to_sequences(words) tokenized_ind
Output
Apply Padding
Let’s apply padding to sentences such that the length of every sentence would be equal
sentlen=20 emdoc=pad_sequences(tokenized_ind,padding='post',maxlen=sentlen) print(emdoc)
Output
[[15080 238 2943 ... 0 0 0] [ 7083 1590 609 ... 0 0 0] [ 766 10857 15081 ... 0 0 0] ... [ 475 3026 218 ... 0 0 0] [ 1688 1153 3090 ... 0 0 0] [ 123 3059 173 ... 0 0 0]]
Creation of LSTM Models
Let’s Create an LSTM model by defining: Embedding Layer, LSTM layer & output layer.
After that Check the summary of the LSTM model.
# Adding 1 because of reserved 0 index # Embedding Layer creates one more vector for "UNKNOWN" words, or padded words (0s). This Vector is filled with zeros. vocab_size = len(tokenizer.word_index) + 1 # 286300+1 embedding_vector_features=40 model=Sequential() model.add(Embedding(vocab_size,embedding_vector_features,input_length=sentlen)) model.add(LSTM(100)) model.add(Dense(1,activation='sigmoid',)) # for classification problem model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) print(model.summary())
Output
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 20, 40) 1145240 lstm (LSTM) (None, 100) 56400 dense (Dense) (None, 1) 101 ================================================================= Total params: 1,201,741 Trainable params: 1,201,741 Non-trainable params: 0 _________________________________________________________________ None
Split the Data into Train and Test
Let’s split the data into train & test. And separate X & Y i.e dependent and independent variables.
X=np.array(emdoc) y=df['is_sarcastic'] y=np.array(y) x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)
Train LSTM Model
In training define the batch size and number of epochs you want.
### Finally Training model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=5,batch_size=64)
Output
Epoch 1/5 300/300 [==============================] - 8s 21ms/step - loss: 0.5173 - accuracy: 0.7259 - val_loss: 0.4302 - val_accuracy: 0.8022 Epoch 2/5 300/300 [==============================] - 6s 20ms/step - loss: 0.2565 - accuracy: 0.8987 - val_loss: 0.4518 - val_accuracy: 0.8031 Epoch 3/5 300/300 [==============================] - 6s 20ms/step - loss: 0.1401 - accuracy: 0.9494 - val_loss: 0.5587 - val_accuracy: 0.7896 Epoch 4/5 300/300 [==============================] - 6s 20ms/step - loss: 0.0846 - accuracy: 0.9708 - val_loss: 0.7461 - val_accuracy: 0.7823 Epoch 5/5 300/300 [==============================] - 6s 20ms/step - loss: 0.0597 - accuracy: 0.9801 - val_loss: 0.7406 - val_accuracy: 0.7808 <keras.callbacks.History at 0x2457e3358e0>
Glove Embedding
Let’s extract the glove word embedding. from txt file.
Glove file link: 200d
embed_dict={} with open("D:\Sentiment Analysis\glove.twitter.27B.200d.txt",'r',encoding="utf-8")as f: for line in f: val=line.split() word=val[0] vector=np.asarray(val[1:],dtype='float32') embed_dict[word]=vector f.close()
Output
{'<user>': array([ 3.1553e-01, 5.3765e-01, 1.0177e-01, 3.2553e-02, 3.7980e-03, 1.5364e-02, -2.0344e-01, 3.3294e-01, -2.0886e-01, 1.0061e-01, 3.0976e-01, 5.0015e-01, 3.2018e-01, 1.3537e-01, 8.7039e-03, 1.9110e-01, 2.4668e-01, -6.0752e-02, -4.3623e-01, 1.9302e-02, 5.9972e-01, 1.3444e-01, 1.2801e-02, -5.4052e-01, 2.7387e-01, -1.1820e+00, -2.7677e-01, 1.1279e-01, 4.6596e-01, -9.0685e-02, 2.4253e-01, 1.5654e-01, -2.3618e-01, 5.7694e-01, 1.7563e-01, -1.9690e-02, 1.8295e-02, 3.7569e-01, -4.1984e-01, 2.2613e-01, -2.0438e-01, -7.6249e-02, 4.0356e-01, 6.1582e-01, -1.0064e-01, 2.3318e-01, 2.2808e-01, 3.4576e-01, -1.4627e-01, -1.9880e-01, 3.3232e-02, -8.4885e-01, -2.5684e-01, 2.6369e-01, 2.9562e-01,
Create Embedding Matrix by using Glove Embeddings
Let’s create an embedded matrix that contains semantic information related to our text embeddings.
word_index=tokenizer.word_index num_words=len(word_index)+1 EMBEDDING_DIM=200 embed_matrix=np.zeros((len(word_index)+1,EMBEDDING_DIM)) for word,i in word_index.items(): if i < num_words: emb_vec=embed_dict.get(word) if emb_vec is not None: #words not found embedding dictionary will be 0 embed_matrix[i]=emb_vec
Train Glove Embedding with LSTM
Let’s apply the result of the embedded matrix i.e glove into LSTM.
Use embedded matrix in Embedding Layer in LSTM.
from tensorflow.keras.initializers import Constant batch_size = 128 epochs = 2 embed_size = 200 #Defining Neural Network model = Sequential() #Non-trainable embeddidng layer model.add(Embedding(vocab_size, output_dim=embed_size, embeddings_initializer=Constant(embed_matrix), input_length=sentlen, trainable=False)) #LSTM model.add(LSTM(100, dropout = 0.5)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) print(model.summary())
Output
Model: "sequential_6" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_6 (Embedding) (None, 20, 200) 5726200 lstm_6 (LSTM) (None, 100) 120400 dense_6 (Dense) (None, 1) 101 ================================================================= Total params: 5,846,701 Trainable params: 120,501 Non-trainable params: 5,726,200 _________________________________________________________________ None
Fit the Model
Let’s fit the LSTM model which we were earlier created.
Define your batch size and the number of epochs you want
hist=model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=5,batch_size=batch_size)
Output
Epoch 1/5 150/150 [==============================] - 5s 27ms/step - loss: 0.5782 - accuracy: 0.6982 - val_loss: 0.5027 - val_accuracy: 0.7564 Epoch 2/5 150/150 [==============================] - 4s 25ms/step - loss: 0.5143 - accuracy: 0.7444 - val_loss: 0.4497 - val_accuracy: 0.7898 Epoch 3/5 150/150 [==============================] - 4s 26ms/step - loss: 0.4859 - accuracy: 0.7675 - val_loss: 0.4528 - val_accuracy: 0.7795 Epoch 4/5 150/150 [==============================] - 4s 26ms/step - loss: 0.4666 - accuracy: 0.7771 - val_loss: 0.4214 - val_accuracy: 0.7999 Epoch 5/5 150/150 [==============================] - 4s 27ms/step - loss: 0.4471 - accuracy: 0.7873 - val_loss: 0.4104 - val_accuracy: 0.8128
Compare Train & Test Accuracy
Let’s compare the accuracy of our training and testing.
print("Accuracy of the model on Training Data is - " , model.evaluate(x_train,y_train)[1]*100) print("Accuracy of the model on Testing Data is - " , model.evaluate(x_val,y_val)[1]*100)
Output
600/600 [==============================] - 2s 4ms/step - loss: 0.3649 - accuracy: 0.8369 Accuracy of the model on Training Data is - 83.68624448776245 296/296 [==============================] - 1s 4ms/step - loss: 0.4104 - accuracy: 0.8128 Accuracy of the model on Testing Data is - 81.2811017036438
Leave a Reply