News headline category prediction
Welcome Reader,
Introduction
We live in a world of data and categorizing things become important as we get more and more data. So, in this article, we will categorize news headlines based on the category of news. for example, sports news, tech news, etc.
Getting Data
We will use a custom dataset prepared by me by web-scraping news headlines along with their category. In this article, we will not go into details like how web-scraping is done. You can download the dataset from here and then place it in your working directory.
Figure-1: Glimpse of Dataset
So, in given dataset there are 3 types of news headlines <1> world <2> tech <3>sports. Plus, there are 7000 headlines of each type. But, this is quite a huge dataset! isn’t it?
Import Required Libraries
import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from sklearn.model_selection import test_train_split import numpy as np import pandas as pd
As the name itself suggests, Tokenizer is used to split the text data into smaller segments/tokens and also makes text ready for the deep neural network. pad_sequences will be used to ensure all sequences return by the Tokenizer are of the same length. It is because simple neural networks like what we are going to use accept only input of the same lenghts. Rest all imports are used frequently in neural network modeling so need no explanation.
Train-Test Split
news_dataset = pd.read_csv('newsfile.csv') training_data,testing_data = train_test_split(news_dataset.iloc[:5000,:],test_size=0.3) # 70% training data # plotting distribution of each news_category in training& testing data plt.plot(training_data['news_category'].value_counts()) plt.plot(testing_data['news_category'].value_counts()) plt.show()
Note that we are using only 5000 headlines from our dataset. As we can see in the plot that data is almost uniformly divided in each category of training and testing datasets.
Tokenization of text data
def tokenization_(training_headings, testing_headings, max_length=20,vocab_size = 5000): tokenizer = Tokenizer(num_words = vocab_size, oov_token= '<oov>') #Tokenization and padding tokenizer.fit_on_texts(training_headings) word_index = tokenizer.word_index training_sequences = tokenizer.texts_to_sequences(training_headings) training_padded = pad_sequences(training_sequences,padding= 'post',maxlen = max_length, truncating='post') testing_sequences = tokenizer.texts_to_sequences(testing_headings) testing_padded = pad_sequences(testing_sequences,padding= 'post',maxlen = max_length, truncating='post') return tokenizer,training_padded,testing_padded
This function is quite straightforward. It takes some training and testing data headlines and return padded sequences associated with them. You may refer to this article present on ValueML to understand more about the tokenization process.
tokenizer,X_train,X_test = tokenization_(training_data['news_headline'], testing_data['news_headline']) labels = {'sports':[0,1,0],'tech':[1,0,0],'world':[0,0,1],} Y_train = np.array([labels[y] for y in training_data['news_category']]) Y_test = np.array([labels[y] for y in testing_data['news_category'] ])
Here, we are separating news_headline and their labels into different lists as they will be used in the model separately for training and testing purposes.
Build neural network
def build_model( n, vocab_size, embedding_size): #n = length of each input vector #Sequential model model = tf.keras.models.Sequential() # Implementing word-embeddings model.add(tf.keras.layers.Embedding(vocab_size, embedding_size,input_length=n)) model.add(tf.keras.layers.GlobalAveragePooling1D()) #output layer model.add(tf.keras.layers.Dense(3,activation = 'softmax')) #compile the model model.compile(loss='categorical_crossentropy',optimizer='adam', metrics='accuracy') #model summary print(model.summary()) return model
As you can see, here I am using only 2 layers. But wait, few layers don’t mean our model is not good enough. Before training the model, Firstly, we will see what this neural network is doing. The very first layer is an embedding layer. This will try to create a vector for each word present in our vocabulary such that words with similar meanings will have their vector pointing in nearly the same direction. For example, If we have 2 words say ‘criminal’ and ‘gangster’ then in a sense these words are likely to point in the nearly same direction. The next layer is the output layer which contains three neurons. Each one of them will contain the probability of whether news headline is of type “world”, “tech” or “sports”.
epochs = 25 history = model.fit(X_train,Y_train, validation_data = (X_test,Y_test), epochs = epochs)
Thus, we have understood what each layer is doing. So, now let’s see the output of this model on training.
Model accuracy on training data : 0.97 Model accuracy on validation data : 0.94
Okay, we got a pretty good result. In addition, now we will also check how our model performs on some unseen data. We will test our model on the remaining part of the dataset that we haven’t used for training.
remaining_data = pd.concat([news_dataset,dataset]).drop_duplicates(keep=False, inplace=False) headlines_new = remaining_data['news_headline'] labels_new = np.array([labels[x] for x in remaining_data['news_category']]) headlines_new = tokenizer.texts_to_sequences(headlines_new) headlines_new = pad_sequences(headlines_new,maxlen = 20, truncating='post') model.evaluate(headlines_new,labels_new)
Here, firstly we are concatenating both our full dataset and the dataset we used for training(i.e. news_dataset) our model. This will lead to duplicates of all those rows which appear in our training dataset. then use the drop_duplicates() method to remove all duplicates as a result we got all rows which model has not seen yet.
Finally, we evaluate our model on this data and get the following result:
441/441 [==============================] - 0s 918us/step - loss: 0.2327 - accuracy: 0.9377 [0.23267975449562073, 0.9376904964447021]
This is pretty good. So we can happily stop here.
Conclusion
In conclusion, We learned how to make a multi-class classification neural network to predict the category of news headlines. I highly recommend the reader to go one step ahead and extend this model to more categories by collecting more data. Till then bye-bye.
Thank you for reading the article. Stay tuned for more such articles on ValueML.
For any query/doubt regarding this article, write in the comments below!
Leave a Reply