Prediction of possibility of bookings using TensorFlow
Hey all! In this post we are going to deal with a Machine Learning classification problem. It is “Prediction of possibility of bookings using TensorFlow” wherein the prediction will be either 1 or 0. The prediction will be 1 if the booking will be canceled or 0 if the booking will not be canceled.
Let’s work on!
Data Pre-processing
Initially let us import the necessary libraries and the data. The data will be available from https://www.kaggle.com/jessemostipak/hotel-booking-demand
import numpy as np import pandas as pd import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.python.keras.layers import Dense import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv("C:/Users/username/Downloads/hotel_bookings.csv") df.columns
Output:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date'], dtype='object')
Here we are having 32 columns. Our target is to predict the “is_canceled” column. Here, some columns contain numbers and some columns contain strings. Our primary task lies in converting the string data types to numbers and then remove the least coherent columns to clean the data.
First, let us do some data cleaning. When we see for the null values in all the columns, more than 90% of the values in the “company” column are NaN. So let us drop that column. Also, the number of people is what we actually want and our consideration on whether the booking will be canceled or not will not depend much on the age group of each and every person. So let us combine the adult, children and babies into one single column.
Let us also drop the “arrival_date_week_number” column which is not so important because we have the date of arrival.
df['adults'] = df['adults'] + df['children'] + df['babies'] df.rename(columns={"adults": "people_count"},inplace = True) df = df.drop(columns = ["children", "babies", "company", "arrival_date_week_number"])
Let us further see the correlation between the hotel booking cancellation and other column values.
corr_df=df.corr() sns.heatmap(corr_df)
Output:
Here we see that the correlation is between the columns that have only numbers. But to arrive at a proper conclusion about the correlation, we also need to consider the other columns with strings. So let’s see how to encode the different string categories in different columns into numbers.
Data preparation and correlation
In order to convert the strings to numbers, we need to know the categories present in the columns. Let us do that.
print(pd.Categorical(df['hotel']),end="\n\n") print(pd.Categorical(df['meal']),end="\n\n") print(pd.Categorical(df['reserved_room_type']),end="\n\n") print(pd.Categorical(df['assigned_room_type']),end="\n\n") print(pd.Categorical(df['deposit_type']),end="\n\n") print(pd.Categorical(df['customer_type']),end="\n\n") print(pd.Categorical(df['reservation_status']))
Output:
[Resort Hotel, Resort Hotel, Resort Hotel, Resort Hotel, Resort Hotel, ..., City Hotel, City Hotel, City Hotel, City Hotel, City Hotel] Length: 119390 Categories (2, object): [City Hotel, Resort Hotel] [BB, BB, BB, BB, BB, ..., BB, BB, BB, BB, HB] Length: 119390 Categories (5, object): [BB, FB, HB, SC, Undefined] [C, C, A, A, A, ..., A, E, D, A, A] Length: 119390 Categories (10, object): [A, B, C, D, ..., G, H, L, P] [C, C, C, A, A, ..., A, E, D, A, A] Length: 119390 Categories (12, object): [A, B, C, D, ..., I, K, L, P] [No Deposit, No Deposit, No Deposit, No Deposit, No Deposit, ..., No Deposit, No Deposit, No Deposit, No Deposit, No Deposit] Length: 119390 Categories (3, object): [No Deposit, Non Refund, Refundable] [Transient, Transient, Transient, Transient, Transient, ..., Transient, Transient, Transient, Transient, Transient] Length: 119390 Categories (4, object): [Contract, Group, Transient, Transient-Party] [Check-Out, Check-Out, Check-Out, Check-Out, Check-Out, ..., Check-Out, Check-Out, Check-Out, Check-Out, Check-Out] Length: 119390 Categories (3, object): [Canceled, Check-Out, No-Show]
We can see that for each column that we have given in “pd.Categorical”, we have got the different categories in that column along with the total number of categories. Let us move further to assign different numbers to different categories.
df['hotel'] = df['hotel'].astype('category') cat_columns = df.select_dtypes(['category']).columns df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
Explanation:
The above chunk of code is a pythonic way of assigning different values starting from 0 to (N-1) to N different categories in a column. To know what category holds which number, we might refer the previous block of code. It is in that same order that the numbers from 0 to (N-1) will be assigned.
In this way, let us repeat the same lines of code to encode all other columns – “meal”, “reserved_room_type”, “assigned_room_type”, “deposit_type”, “customer_type”, “reservation_status”, “market_segment”, “distribution_channel”. We might have to change only the column name in the first line of the code block.
Now, we have 28 columns with all columns holding numbers except the “arrival_date_month” column. We cannot encode it in the same way as we did for other columns, because the categories occur in one order but we want the months to have different numbers from January to December. So let us device a different method as follows:
df['arrival_date_month'] = df['arrival_date_month'].replace(['January', 'February', 'March', 'April','May','June','July','August','September', 'October','November', 'December'], ['1','2','3','4','5','6','7','8','9','10','11','12'])
We have done with converting the string values into numbers according to their categories. Now let us have a look at the correlation.
corr_df=df.corr() sns.heatmap(corr_df)
Output:
We have a slightly varying correlation map from the one we saw before encoding the string columns. We can also print the ” corr_df ” value and the exact correlation values among all the columns.
Here we can notice that the columns: “stays_in_weekend_nights”, “agent”, “reservation_status_date” have really less correlation with the bookings being canceled or not. So let us drop those columns along with the “country” column. We will also drop the rows having null values.
sample = df.drop(columns = ["country", "stays_in_weekend_nights","agent","reservation_status_date"]) sample = sample.dropna() sample.columns
Output:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month', 'stays_in_week_nights', 'people_count', 'meal', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status'], dtype='object')
Training and validation
x = sample[['hotel', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'stays_in_week_nights', 'people_count', 'meal', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status']] y = sample['is_canceled'] x = x.astype(int) y = y.astype(int) print(x.shape) print(y.shape)
Output:
(119386, 22) (119386,)
Let us split the dataset into the train and test set.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
Let us build and compile the model.
def get_model(): model = Sequential([ Dense(10, input_shape = (22,), activation = 'relu'), Dense(5, activation = 'relu'), Dense(1) ]) model.compile( loss='mse', optimizer='adam', metrics = ["accuracy"]) return model model = get_model() model.summary()
Output:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 10) 230 _________________________________________________________________ dense_1 (Dense) (None, 5) 55 _________________________________________________________________ dense_2 (Dense) (None, 1) 6 ================================================================= Total params: 291 Trainable params: 291 Non-trainable params: 0 _________________________________________________________________
Here comes the important part which is “training the model”. Let us use the early stopping to make sure that the model does not overfit.
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback early_stopping = EarlyStopping(monitor='val_loss', patience = 1) model = get_model() history = model.fit( X_train, y_train, validation_data = (X_test, y_test), epochs = 5, callbacks = [early_stopping] )
Let us see the accuracy of the model.
loss, accuracy = model.evaluate(X_test, y_test) print("Loss is : ",loss) print("Accuracy is : ",accuracy*100)
Output:
Loss is : 0.1827656775712967 Accuracy is : 88.30367922782898
We see that we have achieved an accuracy of 88 % which can be considered to be good.
Testing the prediction with a random input
Let us randomly consider the feature values of the 200th row from the test set.
example = pd.DataFrame(X_test.iloc[200,:]) example
Now let us do some processing of the data and then predict the value of “is_canceled” column.
example= np.asarray(example).astype(np.float32) reshaped_sample = example.reshape(1,22) result = model.predict(reshaped_sample)
We need the result to be either 0 or 1 and not with any decimal values. So we are going to take the nearest whole number value.
result = round(result) result
Output:
[0]
The predicted value is 0. Let us look at the ground truth.
ground_label = list(y_test) ground_label[200]
Output:
0
Finally, we have arrived at the correct prediction.
We have built a classification model for the “Prediction of possibility of bookings” using TensorFlow.
THANK YOU
Leave a Reply