Sarcasm Detection with Glove Embeddings using TensorFlow

In this tutorial, we will learn about how to use glove embeddings in the LSTM model to find out whether particular news is sarcastic or not using the TensorFlow deep learning module in Python programming.

You will see the step-by-step code…

What is Glove?

The GloVe is an unsupervised learning technique that generates word vector representations. The resultant representations highlight intriguing linear substructures of the word vector space, and training is based on aggregated global word-word co-occurrence statistics from a corpus.

What is LSTM?

LSTMs are specifically developed to prevent the problem of long-term reliance. They don’t have to work hard to remember knowledge for lengthy periods; it’s like second nature to them!

LSTMs have a chain-like structure as well, but the repeating module is different. Instead of a single neural network layer, there are four, each of which interacts uniquely.

Download the Dataset: Sarcasm Detection

Content of Dataset:

  • is_sarcastic: 1 if the record is sarcastic otherwise 0.
  • headline: the headline of the news article.
  • article_link: link to the original news article.

Import all the required Library

First of all, let’s import all the required Python libraries in order to solve a classification problem:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import Word2Vec,KeyedVectors
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.util import ngrams
from bs4 import BeautifulSoup
from collections import Counter
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
import re,string,unicodedata
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from string import punctuation
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM,Bidirectional,Dropout
from tensorflow.keras.layers import Dense

Reading File

Let’s read the JSON file to a data frame with the help of Pandas library:

df=pd.read_json(r'D:/Sentiment Analysis/sarcasam.json',lines=True)
df.head()

Output

Check Distribution of sarcastic values

Let’s check the frequency distribution of 1 or 0.

sns.countplot(df['is_sarcastic'])
y=df['is_sarcastic']
# These are the class weights for each class
class_1 =( len(y) - len(y[y==0]))/len(y)
class_2 =( len(y) - len(y[y==1]))/len(y)
print(class_1,class_2)

Output

0.476396799329117 0.523603200670883

Text Preprocessing

In-text pre-processing we are making our text lowercase, removing square brackets in between text, removing links from text, removing punctuation from text, removing numbers in the text.

We apply text pre-processing to the headline column of Dataframe. See the Python code below:

stop = stopwords.words('english')
wc =WordNetLemmatizer()
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers,Remove Stopwords,then Lemmatize its.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text


# Applying the cleaning function to both test and training datasets
df['headline'] = df['headline'].apply(lambda x: clean_text(x))
df.head(3)

Output

Let’s check the random row after text preprocessing

df['headline'][0]

Output

'thirtysomething scientists unveil doomsday clock of hair loss'

 

Removing Stopwords from Text

Text may contain stop words like ‘the’, ‘is’, ‘are’, ‘to’, ‘and’, ‘etc’…

By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words.

Here we are removing stopwords from text

def remove_stop(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)
# Applying the cleaning function to both test and training datasets
df['headline'] = df['headline'].apply(lambda x: remove_stop(x))
df.head(3)

Output

Let’s check the random row after stopwords removal

df['headline'][0]

Output

'thirtysomething scientists unveil doomsday clock hair loss'

Most Frequent Words

Let’s analyze the most frequent words in our text with the help of Counter.

def counter_wrd(text):
    cnt=Counter()
    for i in text.values:
        for word in i.split():
            cnt[word]+=1
    return cnt

text=df.headline
#Frequency
counter=counter_wrd(text)
counter.most_common(10)

Output

[('new', 1677),
 ('trump', 1389),
 ('man', 1373),
 ('report', 604),
 ('us', 601),
 ('one', 555),
 ('woman', 505),
 ('area', 494),
 ('says', 485),
 ('day', 475)]

Visualize Sarcastic Words in headlines

Let’s visualize sarcastic words i.e 1 with the help of word cloud.

plt.figure(figsize=(10,20))
wc=WordCloud(max_words=2000,width=1600,height=800,stopwords=STOPWORDS).generate(" ".join(df[df.is_sarcastic==1].headline))
plt.imshow(wc , interpolation = 'bilinear')

Output

Visualize Non-Sarcastic Words

Let’s visualize nonsarcastic words i.e 0 with the help of word cloud.

plt.figure(figsize=(10,20))
wc=WordCloud(max_words=2000,width=1600,height=800,stopwords=STOPWORDS).generate(" ".join(df[df.is_sarcastic==0].headline))
plt.imshow(wc,interpolation='bilinear')

Output

Separating Words

Let’s separate the words using split()

words=[]
for i in df.headline:
    words.append(i.split())
words[:2]

Output

[['thirtysomething',
  'scientists',
  'unveil',
  'doomsday',
  'clock',
  'hair',
  'loss'],
 ['dem',
  'rep',
  'totally',
  'nails',
  'congress',
  'falling',
  'short',
  'gender',
  'racial',
  'equality']]

Check Vocab Size

Let’s find out the vocabulary size using Word2Vec

model=Word2Vec(sentences=words,min_count=1,vector_size=100,window=5)
#vocab size
len(model.wv)

Output

28630

Tokenizer – Convert text into numbers

Let’s convert the word into a numerical representation form.

tokenizer = Tokenizer(num_words=35000)
tokenizer.fit_on_texts(words)
tokenized_ind = tokenizer.texts_to_sequences(words)
tokenized_ind

Output

Apply Padding

Let’s apply padding to sentences such that the length of every sentence would be equal

sentlen=20
emdoc=pad_sequences(tokenized_ind,padding='post',maxlen=sentlen)
print(emdoc)

Output

[[15080   238  2943 ...     0     0     0]
 [ 7083  1590   609 ...     0     0     0]
 [  766 10857 15081 ...     0     0     0]
 ...
 [  475  3026   218 ...     0     0     0]
 [ 1688  1153  3090 ...     0     0     0]
 [  123  3059   173 ...     0     0     0]]

Creation of LSTM Models

Let’s Create an LSTM model by defining: Embedding Layer, LSTM layer & output layer.

After that Check the summary of the LSTM model.

# Adding 1 because of reserved 0 index
# Embedding Layer creates one more vector for "UNKNOWN" words, or padded words (0s). This Vector is filled with zeros.
vocab_size = len(tokenizer.word_index) + 1 # 286300+1
embedding_vector_features=40 
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=sentlen))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid',)) # for classification problem
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Output

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 20, 40)            1145240   
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
=================================================================
Total params: 1,201,741
Trainable params: 1,201,741
Non-trainable params: 0
_________________________________________________________________
None

Split the Data into Train and Test

Let’s split the data into train & test. And separate X & Y i.e dependent and independent variables.

X=np.array(emdoc)
y=df['is_sarcastic']
y=np.array(y)
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

Train LSTM Model

In training define the batch size and number of epochs you want.

### Finally Training
model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=5,batch_size=64)

Output

Epoch 1/5
300/300 [==============================] - 8s 21ms/step - loss: 0.5173 - accuracy: 0.7259 - val_loss: 0.4302 - val_accuracy: 0.8022
Epoch 2/5
300/300 [==============================] - 6s 20ms/step - loss: 0.2565 - accuracy: 0.8987 - val_loss: 0.4518 - val_accuracy: 0.8031
Epoch 3/5
300/300 [==============================] - 6s 20ms/step - loss: 0.1401 - accuracy: 0.9494 - val_loss: 0.5587 - val_accuracy: 0.7896
Epoch 4/5
300/300 [==============================] - 6s 20ms/step - loss: 0.0846 - accuracy: 0.9708 - val_loss: 0.7461 - val_accuracy: 0.7823
Epoch 5/5
300/300 [==============================] - 6s 20ms/step - loss: 0.0597 - accuracy: 0.9801 - val_loss: 0.7406 - val_accuracy: 0.7808
<keras.callbacks.History at 0x2457e3358e0>

Glove Embedding

Let’s extract the glove word embedding. from txt file.

Glove file link: 200d

embed_dict={}
with open("D:\Sentiment Analysis\glove.twitter.27B.200d.txt",'r',encoding="utf-8")as f:
    for line in f:
        val=line.split()
        word=val[0]
        vector=np.asarray(val[1:],dtype='float32')
        embed_dict[word]=vector
f.close()

Output

{'<user>': array([ 3.1553e-01,  5.3765e-01,  1.0177e-01,  3.2553e-02,  3.7980e-03,
         1.5364e-02, -2.0344e-01,  3.3294e-01, -2.0886e-01,  1.0061e-01,
         3.0976e-01,  5.0015e-01,  3.2018e-01,  1.3537e-01,  8.7039e-03,
         1.9110e-01,  2.4668e-01, -6.0752e-02, -4.3623e-01,  1.9302e-02,
         5.9972e-01,  1.3444e-01,  1.2801e-02, -5.4052e-01,  2.7387e-01,
        -1.1820e+00, -2.7677e-01,  1.1279e-01,  4.6596e-01, -9.0685e-02,
         2.4253e-01,  1.5654e-01, -2.3618e-01,  5.7694e-01,  1.7563e-01,
        -1.9690e-02,  1.8295e-02,  3.7569e-01, -4.1984e-01,  2.2613e-01,
        -2.0438e-01, -7.6249e-02,  4.0356e-01,  6.1582e-01, -1.0064e-01,
         2.3318e-01,  2.2808e-01,  3.4576e-01, -1.4627e-01, -1.9880e-01,
         3.3232e-02, -8.4885e-01, -2.5684e-01,  2.6369e-01,  2.9562e-01,

 

Create Embedding Matrix by using Glove Embeddings

Let’s create an embedded matrix that contains semantic information related to our text embeddings.

word_index=tokenizer.word_index
num_words=len(word_index)+1
EMBEDDING_DIM=200
embed_matrix=np.zeros((len(word_index)+1,EMBEDDING_DIM))
for word,i in word_index.items():
    if i < num_words:
        emb_vec=embed_dict.get(word)
        if emb_vec is not None: #words not found embedding dictionary will be 0
            embed_matrix[i]=emb_vec

Train Glove Embedding with LSTM

Let’s apply the result of the embedded matrix i.e glove into LSTM.

Use embedded matrix in Embedding Layer in LSTM.

from tensorflow.keras.initializers import Constant
batch_size = 128
epochs = 2
embed_size = 200
#Defining Neural Network
model = Sequential()
#Non-trainable embeddidng layer
model.add(Embedding(vocab_size, output_dim=embed_size, embeddings_initializer=Constant(embed_matrix), input_length=sentlen, trainable=False))
#LSTM 
model.add(LSTM(100, dropout = 0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Output

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_6 (Embedding)     (None, 20, 200)           5726200   
                                                                 
 lstm_6 (LSTM)               (None, 100)               120400    
                                                                 
 dense_6 (Dense)             (None, 1)                 101       
                                                                 
=================================================================
Total params: 5,846,701
Trainable params: 120,501
Non-trainable params: 5,726,200
_________________________________________________________________
None

Fit the Model

Let’s fit the LSTM model which we were earlier created.

Define your batch size and the number of epochs you want

hist=model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=5,batch_size=batch_size)

Output

Epoch 1/5
150/150 [==============================] - 5s 27ms/step - loss: 0.5782 - accuracy: 0.6982 - val_loss: 0.5027 - val_accuracy: 0.7564
Epoch 2/5
150/150 [==============================] - 4s 25ms/step - loss: 0.5143 - accuracy: 0.7444 - val_loss: 0.4497 - val_accuracy: 0.7898
Epoch 3/5
150/150 [==============================] - 4s 26ms/step - loss: 0.4859 - accuracy: 0.7675 - val_loss: 0.4528 - val_accuracy: 0.7795
Epoch 4/5
150/150 [==============================] - 4s 26ms/step - loss: 0.4666 - accuracy: 0.7771 - val_loss: 0.4214 - val_accuracy: 0.7999
Epoch 5/5
150/150 [==============================] - 4s 27ms/step - loss: 0.4471 - accuracy: 0.7873 - val_loss: 0.4104 - val_accuracy: 0.8128

Compare Train & Test Accuracy

Let’s compare the accuracy of our training and testing.

print("Accuracy of the model on Training Data is - " , model.evaluate(x_train,y_train)[1]*100)
print("Accuracy of the model on Testing Data is - " , model.evaluate(x_val,y_val)[1]*100)

Output

600/600 [==============================] - 2s 4ms/step - loss: 0.3649 - accuracy: 0.8369
Accuracy of the model on Training Data is -  83.68624448776245
296/296 [==============================] - 1s 4ms/step - loss: 0.4104 - accuracy: 0.8128
Accuracy of the model on Testing Data is -  81.2811017036438

 

Leave a Reply

Your email address will not be published.