News headline category prediction

Welcome Reader,

Introduction

We live in a world of data and categorizing things become important as we get more and more data. So, in this article, we will categorize news headlines based on the category of news. for example, sports news, tech news, etc.

Getting Data

We will use a custom dataset prepared by me by web-scraping news headlines along with their category. In this article, we will not go into details like how web-scraping is done. You can download the dataset from here and then place it in your working directory.

Figure-1: Glimpse of Dataset

So, in given dataset there are 3 types of news headlines <1> world  <2> tech  <3>sports. Plus, there are 7000 headlines of each type. But, this is quite a huge dataset! isn’t it?

Import Required Libraries

import tensorflow as tf 
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import test_train_split
import numpy as np
import pandas as pd

As the name itself suggests, Tokenizer is used to split the text data into smaller segments/tokens and also makes text ready for the deep neural network. pad_sequences  will be used to ensure all sequences return by the Tokenizer are of the same length. It is because simple neural networks like what we are going to use accept only input of the same lenghts. Rest all imports are used frequently in neural network modeling so need no explanation.

Train-Test Split

news_dataset = pd.read_csv('newsfile.csv')

training_data,testing_data =  train_test_split(news_dataset.iloc[:5000,:],test_size=0.3)  # 70% training data

# plotting distribution of each news_category in training& testing data
plt.plot(training_data['news_category'].value_counts())
plt.plot(testing_data['news_category'].value_counts())
plt.show()

 

Note that we are using only 5000 headlines from our dataset. As we can see in the plot that data is almost uniformly divided in each category of training and testing datasets.

Tokenization of text data

def tokenization_(training_headings, testing_headings, max_length=20,vocab_size = 5000):
    tokenizer = Tokenizer(num_words = vocab_size, oov_token= '<oov>')
    #Tokenization and padding

    tokenizer.fit_on_texts(training_headings)
    word_index = tokenizer.word_index
    training_sequences = tokenizer.texts_to_sequences(training_headings)
    training_padded = pad_sequences(training_sequences,padding= 'post',maxlen = max_length, truncating='post')


    testing_sequences = tokenizer.texts_to_sequences(testing_headings)
    testing_padded = pad_sequences(testing_sequences,padding= 'post',maxlen = max_length, truncating='post')

    return tokenizer,training_padded,testing_padded

This function is quite straightforward. It takes some training and testing data headlines and return padded sequences associated with them. You may refer to this article present on ValueML to understand more about the tokenization process.

tokenizer,X_train,X_test = tokenization_(training_data['news_headline'],
                                         testing_data['news_headline'])

labels = {'sports':[0,1,0],'tech':[1,0,0],'world':[0,0,1],}
Y_train = np.array([labels[y] for y in training_data['news_category']])
Y_test = np.array([labels[y]  for y in testing_data['news_category'] ])

Here, we are separating news_headline and their labels into different lists as they will be used in the model separately for training and testing purposes.

 

Build neural network

def build_model( n, vocab_size, embedding_size):   #n = length of each input vector 

    #Sequential model 
    model = tf.keras.models.Sequential()

    # Implementing word-embeddings
    model.add(tf.keras.layers.Embedding(vocab_size,
              embedding_size,input_length=n))

    model.add(tf.keras.layers.GlobalAveragePooling1D()) 

    #output layer
    model.add(tf.keras.layers.Dense(3,activation = 'softmax'))  

    #compile the model
     model.compile(loss='categorical_crossentropy',optimizer='adam',
                   metrics='accuracy')

    #model summary
    print(model.summary())

    return model

As you can see, here I am using only 2 layers. But wait, few layers don’t mean our model is not good enough. Before training the model, Firstly, we will see what this neural network is doing. The very first layer is an embedding layer.   This will try to create a vector for each word present in our vocabulary such that words with similar meanings will have their vector pointing in nearly the same direction. For example, If we have 2 words say ‘criminal’ and ‘gangster’ then in a sense these words are likely to point in the nearly same direction. The next layer is the output layer which contains three neurons. Each one of them will contain the probability of whether news headline is of type “world”, “tech” or “sports”.

 

epochs = 25
history = model.fit(X_train,Y_train,
                    validation_data = (X_test,Y_test),
                    epochs = epochs)

Thus, we have understood what each layer is doing. So, now let’s see the output of this model on training.

Model accuracy on training data : 0.97
Model accuracy on validation data : 0.94

Okay, we got a pretty good result. In addition, now we will also check how our model performs on some unseen data. We will test our model on the remaining part of the dataset that we haven’t used for training.

remaining_data = pd.concat([news_dataset,dataset]).drop_duplicates(keep=False,    inplace=False)

headlines_new = remaining_data['news_headline']
labels_new = np.array([labels[x] for x in remaining_data['news_category']])

headlines_new = tokenizer.texts_to_sequences(headlines_new)
headlines_new = pad_sequences(headlines_new,maxlen = 20, truncating='post')

model.evaluate(headlines_new,labels_new)

 Here, firstly we are concatenating both our full dataset and the dataset we used for training(i.e. news_dataset) our model. This will lead to duplicates of all those rows which appear in our training dataset. then use the drop_duplicates() method to remove all duplicates as a result we got all rows which model has not seen yet.

Finally, we evaluate our model on this data and get the following result:

441/441 [==============================] - 0s 918us/step - loss: 0.2327 - accuracy: 0.9377
[0.23267975449562073, 0.9376904964447021]

This is pretty good. So we can happily stop here.

Conclusion

In conclusion, We learned how to make a multi-class classification neural network to predict the category of news headlines. I highly recommend the reader to go one step ahead and extend this model to more categories by collecting more data. Till then bye-bye.

Thank you for reading the article.  Stay tuned for more such articles on ValueML.

For any query/doubt regarding this article, write in the comments below!

Leave a Reply

Your email address will not be published. Required fields are marked *