Fake News Classifier using LSTM In Python

In this article, We are going to discuss building a fake news classifier. For this task, we will use LSTM(Long Short- Term Memory). We will use LSTM because these networks are great in dealing with long term dependencies. The classifier will give an output 0(Fake News),1(Real News).In a world full of information where some information can be quite misleading, it’s essential to know the authenticity. So let’s get straight to it.

You can get the dataset from https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

ABOUT LSTM

LSTM(Long Short-Term Memory) is mainly used when we need to deal with sequential data. A simple vanilla neural network does not have a memory state. But when we are dealing with sequences it becomes important to incorporate data from the previous timestamp as well. LSTM has a memory state and is useful for dealing with long sequences of information. Therefore it is very useful.

DATA PREPARATION

IMPORTING LIBRARIES

!pip install plotly
!pip install --upgrade nbformat
!pip install nltk
!pip install spacy # spaCy is an open-source software library for advanced natural language processing
!pip install WordCloud
!pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing
import nltk
nltk.download('punkt')

import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import nltk
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
# import keras
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model

That is a whole lot of libraries but we will implement each and learn their use. Libraries like spacy,gensim,nltk are very important in NLP(Natural Language Processing).

DATA ANALYSIS

Let’s start by reading the data from the files.

df_true = pd.read_csv("True.csv")
df_fake = pd.read_csv("Fake.csv")

Adding a target column to classify fake news and real news distinctly.

df_true['isfake'] = 1
df_true.head()

OUTPUT

df_fake['isfake'] = 0
df_fake.head()

OUTPUT

 

To make our task easier we merge both the data frames containing fake and real news.

df = pd.concat([df_true, df_fake]).reset_index(drop = True)
df

OUTPUT

We will make some changes to the data by dropping the date column and combining the title and text columns.

df.drop(columns = ['date'], inplace = True)
df['original'] = df['title'] + ' ' + df['text']
df['original'][0]

OUTPUT

'As U.S. budget fight looms, Republicans flip their fiscal script WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific research, infrastructure, public health and environmental protection. “The (Trump) administration has already been willing to say: ‘We’re going to increase non-defense discretionary spending ... by about 7 percent,’” Meadows, chairman of the small but influential House Freedom Caucus, said on the program.

DATA CLEANING

Now comes a crucial task of data cleaning where we will make the model suitable for training.

nltk.download("stopwords")

Stop words are usually the words which the model does not consider useful for training. Some examples can be

(a, an, the).

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

We can extend our understanding of stop words by putting additional words in the corpus. After this, we will remove the stop words from our data.

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words:
            result.append(token)
            
    return result

Now let’s apply the above function to our data.

df['clean'] = df['original'].apply(preprocess)

We will see that the data is cleaned and stop words are removed.

print(df['clean'][0])

OUTPUT

['budget', 'fight', 'looms', 'republicans', 'flip', 'fiscal', 'script', 'washington', 'reuters', 'head', 'conservative', 'republican', 'faction', 'congress', 'voted', 'month', 'huge', 'expansion', 'national', 'debt', 'cuts', 'called', 'fiscal', 'conservative', 'sunday', 'urged', 'budget', 'restraint', 'keeping', 'sharp', 'pivot', 'republicans', 'representative', 'mark', 'meadows', 'speaking', 'face', 'nation', 'drew', 'hard', 'line', 'federal', 'spending', 'lawmakers', 'bracing', 'battle', 'january', 'return', 'holidays', 'wednesday', 'lawmakers', 'begin', 'trying', 'pass', 'federal', 'budget', 'fight', 'likely', 'linked', 'issues', 'immigration', 'policy', 'november', 'congressional', 'election', 'campaigns', 'approach', 'republicans', 'seek', 'control', 'congress', 'president', 'donald', 'trump', 'republicans', 'want', 'budget', 'increase', 'military', 'spending', 'democrats', 'want', 'proportional', 'increases', 'defense', 'discretionary', 'spending', 'programs', 'support', 'education', 'scientific', 'research', 'infrastructure', 'public', 'health', 'environmental', 'protection', 'trump', 'administration', 'willing', 'going', 'increase', 'defense', 'discretionary', 'spending', 'percent', 'meadows', 'chairman', 'small', 'influential', 'house', ]

Let’s now see the number of words we are dealing with.

list_of_words = []
for i in df.clean:
    for j in i:
        list_of_words.append(j)
list_of_words

OUTPUT

['budget',
 'fight',
 'looms',
 'republicans',
 'flip',
 'fiscal',
 'script',
 'washington',
 'reuters',
 'head',
 'conservative',
 'republican',
 'faction',
 'congress',....]
Next,we will add the cleaned data to our data frame after the stop words have been removed.
df['clean_joined'] = df['clean'].apply(lambda x: " ".join(x))
df

OUTPUT

VISUALISE THE DATA

We will use a powerful visualisation technique called word cloud which is used in NLP. The basic use of a word cloud is to visualise the frequently used words. The larger the text corresponds to the more number of times it is being used.

plt.figure(figsize = (15,15)) 
wc = WordCloud(max_words = 1800 , width = 1500 , height = 700 , stopwords = stop_words).generate(" ".join(df[df.isfake == 1].clean_joined))
plt.imshow(wc, interpolation = 'bilinear')

OUTPUT

TOKENIZATION AND PADDING

Now we will convert the sentences into tokens. We will use a tokenizer and these words are going to be converted into integers to train the model. Also, We know that the classifier cannot train on text data therefore tokenizing is important.

Before tokenizing, we are going to split the data into training and test set using scikit-learn.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.clean_joined, df.isfake, test_size = 0.2)

Also, Let’s Tokenise now.

from nltk import word_tokenize
tokenizer = Tokenizer(num_words = total_words)
tokenizer.fit_on_texts(x_train)
train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)

Let’s now explore padding. Our training and testing sentences vary in length, therefore, we will use padding to reduce the size of the longer sentences. For this, we are going to specify a maximum length also the shorter sentences will be stretched.

padded_train = pad_sequences(train_sequences,maxlen = 40, padding = 'post', truncating = 'post')
padded_test = pad_sequences(test_sequences,maxlen = 40, truncating = 'post')
for i,doc in enumerate(padded_train[:2]):
     print("The padded encoding for document",i+1," is : ",doc)

OUTPUT

The padded encoding for document 1  is :  [ 2365   558   332  2311  2716    42   972    27 11043   950   513   120
   258    57    30   558   332  6402   972     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
The padded encoding for document 2  is :  [   49   183     5  3537   231    75  4423    20   877   694   751  4037
    16    12    20   278   316   694   751   838    38   204 23342   844
  1023   694 49568  9060     4  4423   348  4631   352    98    45    20
   521   694   751   355]

As we can see that there are 0’s inserted. This is going to stretch the shorter sentences.

BUILDING AND TRAINING THE MODEL

model = Sequential()
model.add(Embedding(total_words, output_dim = 128))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1,activation= 'sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

We are now going to explore the layers:

Embedding Layer: It transforms the words into vectors. It is also useful to find the relationship between similar words.

Bidirectional LSTM: After observing the model’s performance we will find that it is better with bidirectional LSTM.This is because it is going to deal with data from both the previous and future time stamps.

Sigmoid Activation: This is used because our output is a binary classification problem and only one neuron is used as output neuron.

OUTPUT

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 128)         13914112  
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               263168    
_________________________________________________________________
dense_6 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 129       
=================================================================
Total params: 14,210,305
Trainable params: 14,210,305
Non-trainable params: 0
_____________________________
y_train = np.asarray(y_train)

In the next step, we are going to train the model.

model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)

Here we are also going to observe that we are using only 2 epochs this is how powerful bidirectional LSTM is.validation_split is used for checking if we are overfitting.

OUTPUT

Train on 32326 samples, validate on 3592 samples
Epoch 1/2
32326/32326 [==============================] - 321s 10ms/sample - loss: 0.0421 - acc: 0.9815 - val_loss: 0.0073 - val_acc: 0.9992
Epoch 2/2
32326/32326 [==============================] - 316s 10ms/sample - loss: 0.0016 - acc: 0.9997 - val_loss: 0.0096 - val_acc: 0.9981

We get a validation accuracy of 99.8% which is very good for our model to make predictions.

pred = model.predict(padded_test)
prediction = []
for i in range(len(pred)):
    if pred[i].item() > 0.5:
        prediction.append(1)
    else:
        prediction.append(0)

We then create a list. Where if the model predicts a value greater than 0.5 it is real news else it is fake.

Thanks for reading! You can also check out

Building A Movie Review Classifier Using Tensorflow And Keras

 

Leave a Reply

Your email address will not be published. Required fields are marked *