Fake News Classifier using LSTM In Python
In this article, We are going to discuss building a fake news classifier. For this task, we will use LSTM(Long Short- Term Memory). We will use LSTM because these networks are great in dealing with long term dependencies. The classifier will give an output 0(Fake News),1(Real News).In a world full of information where some information can be quite misleading, it’s essential to know the authenticity. So let’s get straight to it.
You can get the dataset from https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
ABOUT LSTM
LSTM(Long Short-Term Memory) is mainly used when we need to deal with sequential data. A simple vanilla neural network does not have a memory state. But when we are dealing with sequences it becomes important to incorporate data from the previous timestamp as well. LSTM has a memory state and is useful for dealing with long sequences of information. Therefore it is very useful.
DATA PREPARATION
IMPORTING LIBRARIES
!pip install plotly !pip install --upgrade nbformat !pip install nltk !pip install spacy # spaCy is an open-source software library for advanced natural language processing !pip install WordCloud !pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing import nltk nltk.download('punkt') import tensorflow as tf import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from wordcloud import WordCloud, STOPWORDS import nltk import re from nltk.stem import PorterStemmer, WordNetLemmatizer from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize import gensim from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS # import keras from tensorflow.keras.preprocessing.text import one_hot, Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional from tensorflow.keras.models import Model
That is a whole lot of libraries but we will implement each and learn their use. Libraries like spacy,gensim,nltk are very important in NLP(Natural Language Processing).
DATA ANALYSIS
Let’s start by reading the data from the files.
df_true = pd.read_csv("True.csv") df_fake = pd.read_csv("Fake.csv")
Adding a target column to classify fake news and real news distinctly.
df_true['isfake'] = 1 df_true.head()
OUTPUT
df_fake['isfake'] = 0 df_fake.head()
OUTPUT
To make our task easier we merge both the data frames containing fake and real news.
df = pd.concat([df_true, df_fake]).reset_index(drop = True) df
OUTPUT
We will make some changes to the data by dropping the date column and combining the title and text columns.
df.drop(columns = ['date'], inplace = True) df['original'] = df['title'] + ' ' + df['text']
df['original'][0]
OUTPUT
'As U.S. budget fight looms, Republicans flip their fiscal script WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific research, infrastructure, public health and environmental protection. “The (Trump) administration has already been willing to say: ‘We’re going to increase non-defense discretionary spending ... by about 7 percent,’” Meadows, chairman of the small but influential House Freedom Caucus, said on the program.
DATA CLEANING
Now comes a crucial task of data cleaning where we will make the model suitable for training.
nltk.download("stopwords")
Stop words are usually the words which the model does not consider useful for training. Some examples can be
(a, an, the).
from nltk.corpus import stopwords stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
We can extend our understanding of stop words by putting additional words in the corpus. After this, we will remove the stop words from our data.
def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words: result.append(token) return result
Now let’s apply the above function to our data.
df['clean'] = df['original'].apply(preprocess)
We will see that the data is cleaned and stop words are removed.
print(df['clean'][0])
OUTPUT
['budget', 'fight', 'looms', 'republicans', 'flip', 'fiscal', 'script', 'washington', 'reuters', 'head', 'conservative', 'republican', 'faction', 'congress', 'voted', 'month', 'huge', 'expansion', 'national', 'debt', 'cuts', 'called', 'fiscal', 'conservative', 'sunday', 'urged', 'budget', 'restraint', 'keeping', 'sharp', 'pivot', 'republicans', 'representative', 'mark', 'meadows', 'speaking', 'face', 'nation', 'drew', 'hard', 'line', 'federal', 'spending', 'lawmakers', 'bracing', 'battle', 'january', 'return', 'holidays', 'wednesday', 'lawmakers', 'begin', 'trying', 'pass', 'federal', 'budget', 'fight', 'likely', 'linked', 'issues', 'immigration', 'policy', 'november', 'congressional', 'election', 'campaigns', 'approach', 'republicans', 'seek', 'control', 'congress', 'president', 'donald', 'trump', 'republicans', 'want', 'budget', 'increase', 'military', 'spending', 'democrats', 'want', 'proportional', 'increases', 'defense', 'discretionary', 'spending', 'programs', 'support', 'education', 'scientific', 'research', 'infrastructure', 'public', 'health', 'environmental', 'protection', 'trump', 'administration', 'willing', 'going', 'increase', 'defense', 'discretionary', 'spending', 'percent', 'meadows', 'chairman', 'small', 'influential', 'house', ]
Let’s now see the number of words we are dealing with.
list_of_words = [] for i in df.clean: for j in i: list_of_words.append(j)
list_of_words
OUTPUT
['budget', 'fight', 'looms', 'republicans', 'flip', 'fiscal', 'script', 'washington', 'reuters', 'head', 'conservative', 'republican', 'faction', 'congress',....]
Next,we will add the cleaned data to our data frame after the stop words have been removed.
df['clean_joined'] = df['clean'].apply(lambda x: " ".join(x)) df
OUTPUT
VISUALISE THE DATA
We will use a powerful visualisation technique called word cloud which is used in NLP. The basic use of a word cloud is to visualise the frequently used words. The larger the text corresponds to the more number of times it is being used.
plt.figure(figsize = (15,15)) wc = WordCloud(max_words = 1800 , width = 1500 , height = 700 , stopwords = stop_words).generate(" ".join(df[df.isfake == 1].clean_joined)) plt.imshow(wc, interpolation = 'bilinear')
OUTPUT
TOKENIZATION AND PADDING
Now we will convert the sentences into tokens. We will use a tokenizer and these words are going to be converted into integers to train the model. Also, We know that the classifier cannot train on text data therefore tokenizing is important.
Before tokenizing, we are going to split the data into training and test set using scikit-learn.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(df.clean_joined, df.isfake, test_size = 0.2)
Also, Let’s Tokenise now.
from nltk import word_tokenize tokenizer = Tokenizer(num_words = total_words) tokenizer.fit_on_texts(x_train) train_sequences = tokenizer.texts_to_sequences(x_train) test_sequences = tokenizer.texts_to_sequences(x_test)
Let’s now explore padding. Our training and testing sentences vary in length, therefore, we will use padding to reduce the size of the longer sentences. For this, we are going to specify a maximum length also the shorter sentences will be stretched.
padded_train = pad_sequences(train_sequences,maxlen = 40, padding = 'post', truncating = 'post') padded_test = pad_sequences(test_sequences,maxlen = 40, truncating = 'post')
for i,doc in enumerate(padded_train[:2]): print("The padded encoding for document",i+1," is : ",doc)
OUTPUT
The padded encoding for document 1 is : [ 2365 558 332 2311 2716 42 972 27 11043 950 513 120 258 57 30 558 332 6402 972 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] The padded encoding for document 2 is : [ 49 183 5 3537 231 75 4423 20 877 694 751 4037 16 12 20 278 316 694 751 838 38 204 23342 844 1023 694 49568 9060 4 4423 348 4631 352 98 45 20 521 694 751 355]
As we can see that there are 0’s inserted. This is going to stretch the shorter sentences.
BUILDING AND TRAINING THE MODEL
model = Sequential() model.add(Embedding(total_words, output_dim = 128)) model.add(Bidirectional(LSTM(128))) model.add(Dense(128, activation = 'relu')) model.add(Dense(1,activation= 'sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) model.summary()
We are now going to explore the layers:
Embedding Layer: It transforms the words into vectors. It is also useful to find the relationship between similar words.
Bidirectional LSTM: After observing the model’s performance we will find that it is better with bidirectional LSTM.This is because it is going to deal with data from both the previous and future time stamps.
Sigmoid Activation: This is used because our output is a binary classification problem and only one neuron is used as output neuron.
OUTPUT
Model: "sequential_3" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_3 (Embedding) (None, None, 128) 13914112 _________________________________________________________________ bidirectional_3 (Bidirection (None, 256) 263168 _________________________________________________________________ dense_6 (Dense) (None, 128) 32896 _________________________________________________________________ dense_7 (Dense) (None, 1) 129 ================================================================= Total params: 14,210,305 Trainable params: 14,210,305 Non-trainable params: 0 _____________________________
y_train = np.asarray(y_train)
In the next step, we are going to train the model.
model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)
Here we are also going to observe that we are using only 2 epochs this is how powerful bidirectional LSTM is.validation_split is used for checking if we are overfitting.
OUTPUT
Train on 32326 samples, validate on 3592 samples Epoch 1/2 32326/32326 [==============================] - 321s 10ms/sample - loss: 0.0421 - acc: 0.9815 - val_loss: 0.0073 - val_acc: 0.9992 Epoch 2/2 32326/32326 [==============================] - 316s 10ms/sample - loss: 0.0016 - acc: 0.9997 - val_loss: 0.0096 - val_acc: 0.9981
We get a validation accuracy of 99.8% which is very good for our model to make predictions.
pred = model.predict(padded_test)
prediction = [] for i in range(len(pred)): if pred[i].item() > 0.5: prediction.append(1) else: prediction.append(0)
We then create a list. Where if the model predicts a value greater than 0.5 it is real news else it is fake.
Thanks for reading! You can also check out
Building A Movie Review Classifier Using Tensorflow And Keras
Leave a Reply