Prediction of possibility of bookings using TensorFlow

Hey all! In this post we are going to deal with a Machine Learning classification problem. It is “Prediction of possibility of bookings using TensorFlow” wherein the prediction will be either 1 or 0. The prediction will be 1 if the booking will be canceled or 0 if the booking will not be canceled.

Let’s work on!

Data Pre-processing

Initially let us import the necessary libraries and the data. The data will be available from https://www.kaggle.com/jessemostipak/hotel-booking-demand

import numpy as np
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.python.keras.layers import Dense

import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("C:/Users/username/Downloads/hotel_bookings.csv")
df.columns

Output:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

Here we are having 32 columns. Our target is to predict the “is_canceled” column. Here, some columns contain numbers and some columns contain strings. Our primary task lies in converting the string data types to numbers and then remove the least coherent columns to clean the data.

 

First, let us do some data cleaning. When we see for the null values in all the columns, more than 90% of the values in the “company” column are NaN. So let us drop that column. Also, the number of people is what we actually want and our consideration on whether the booking will be canceled or not will not depend much on the age group of each and every person. So let us combine the adult, children and babies into one single column.

Let us also drop the “arrival_date_week_number” column which is not so important because we have the date of arrival.

df['adults'] = df['adults'] + df['children'] + df['babies']
df.rename(columns={"adults": "people_count"},inplace = True)
df = df.drop(columns = ["children", "babies", "company", "arrival_date_week_number"])

 

Let us further see the correlation between the hotel booking cancellation and other column values.

corr_df=df.corr()
sns.heatmap(corr_df)

Output:

correlation between the hotel booking cancellation and other column values

Here we see that the correlation is between the columns that have only numbers. But to arrive at a proper conclusion about the correlation, we also need to consider the other columns with strings. So let’s see how to encode the different string categories in different columns into numbers.

Data preparation and correlation

In order to convert the strings to numbers, we need to know the categories present in the columns. Let us do that.

print(pd.Categorical(df['hotel']),end="\n\n")
print(pd.Categorical(df['meal']),end="\n\n")
print(pd.Categorical(df['reserved_room_type']),end="\n\n")
print(pd.Categorical(df['assigned_room_type']),end="\n\n")
print(pd.Categorical(df['deposit_type']),end="\n\n")
print(pd.Categorical(df['customer_type']),end="\n\n")
print(pd.Categorical(df['reservation_status']))

Output:

[Resort Hotel, Resort Hotel, Resort Hotel, Resort Hotel, Resort Hotel, ..., City Hotel, City Hotel, City Hotel, City Hotel, City Hotel]
Length: 119390
Categories (2, object): [City Hotel, Resort Hotel]

[BB, BB, BB, BB, BB, ..., BB, BB, BB, BB, HB]
Length: 119390
Categories (5, object): [BB, FB, HB, SC, Undefined]

[C, C, A, A, A, ..., A, E, D, A, A]
Length: 119390
Categories (10, object): [A, B, C, D, ..., G, H, L, P]

[C, C, C, A, A, ..., A, E, D, A, A]
Length: 119390
Categories (12, object): [A, B, C, D, ..., I, K, L, P]

[No Deposit, No Deposit, No Deposit, No Deposit, No Deposit, ..., No Deposit, No Deposit, No Deposit, No Deposit, No Deposit]
Length: 119390
Categories (3, object): [No Deposit, Non Refund, Refundable]

[Transient, Transient, Transient, Transient, Transient, ..., Transient, Transient, Transient, Transient, Transient]
Length: 119390
Categories (4, object): [Contract, Group, Transient, Transient-Party]

[Check-Out, Check-Out, Check-Out, Check-Out, Check-Out, ..., Check-Out, Check-Out, Check-Out, Check-Out, Check-Out]
Length: 119390
Categories (3, object): [Canceled, Check-Out, No-Show]

We can see that for each column that we have given in “pd.Categorical”, we have got the different categories in that column along with the total number of categories. Let us move further to assign different numbers to different categories.

 

df['hotel'] = df['hotel'].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

Explanation:

The above chunk of code is a pythonic way of assigning different values starting from 0 to (N-1) to N different categories in a column. To know what category holds which number, we might refer the previous block of code. It is in that same order that the numbers from 0 to (N-1) will be assigned.

In this way, let us repeat the same lines of code to encode all other columns – “meal”, “reserved_room_type”, “assigned_room_type”, “deposit_type”, “customer_type”, “reservation_status”, “market_segment”, “distribution_channel”. We might have to change only the column name in the first line of the code block.

 

Now, we have 28 columns with all columns holding numbers except the “arrival_date_month” column. We cannot encode it in the same way as we did for other columns, because the categories occur in one order but we want the months to have different numbers from January to December. So let us device a different method as follows:

df['arrival_date_month'] = df['arrival_date_month'].replace(['January', 'February', 'March',
                                                            'April','May','June','July','August','September',
                                                            'October','November', 'December'],
                                                            ['1','2','3','4','5','6','7','8','9','10','11','12'])

We have done with converting the string values into numbers according to their categories. Now let us have a look at the correlation.

corr_df=df.corr()
sns.heatmap(corr_df)

Output:

correlation

We have a slightly varying correlation map from the one we saw before encoding the string columns. We can also print the ” corr_df ” value and the exact correlation values among all the columns.

Here we can notice that the columns: “stays_in_weekend_nights”, “agent”, “reservation_status_date” have really less correlation with the bookings being canceled or not. So let us drop those columns along with the “country” column. We will also drop the rows having null values.

sample = df.drop(columns = ["country", "stays_in_weekend_nights","agent","reservation_status_date"])
sample = sample.dropna()
sample.columns

Output:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_day_of_month',
       'stays_in_week_nights', 'people_count', 'meal', 'market_segment',
       'distribution_channel', 'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type',
       'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status'],
      dtype='object')

 

Training and validation

x = sample[['hotel', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'stays_in_week_nights', 'people_count', 'meal',
       'market_segment', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'reserved_room_type', 'assigned_room_type', 'booking_changes',
       'deposit_type', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status']]

y = sample['is_canceled']

x = x.astype(int)
y = y.astype(int)

print(x.shape)
print(y.shape)

Output:

(119386, 22)
(119386,)

 

Let us split the dataset into the train and test set.

from sklearn.model_selection import train_test_split         
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

 

Let us build and compile the model.

def get_model():
    
    model = Sequential([
        Dense(10, input_shape = (22,), activation = 'relu'),
        Dense(5, activation = 'relu'),
        Dense(1)
    ])

    model.compile(
        loss='mse',
        optimizer='adam',
    metrics = ["accuracy"])
    
    return model

model = get_model()

model.summary()

Output:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 10)                230       
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6         
=================================================================
Total params: 291
Trainable params: 291
Non-trainable params: 0
_________________________________________________________________

 

Here comes the important part which is “training the model”. Let us use the early stopping to make sure that the model does not overfit.

from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback

early_stopping = EarlyStopping(monitor='val_loss', patience = 1)

model = get_model()


history = model.fit(
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs = 5,
    callbacks = [early_stopping]
)

 

Let us see the accuracy of the model.

loss, accuracy = model.evaluate(X_test, y_test)
print("Loss is : ",loss)
print("Accuracy is : ",accuracy*100)

Output:

Loss is :  0.1827656775712967
Accuracy is :  88.30367922782898

We see that we have achieved an accuracy of 88 % which can be considered to be good.

 

Testing the prediction with a random input

Let us randomly consider the feature values of the 200th row from the test set.

example = pd.DataFrame(X_test.iloc[200,:])
example

 

Now let us do some processing of the data and then predict the value of  “is_canceled” column.

example= np.asarray(example).astype(np.float32)
reshaped_sample = example.reshape(1,22)
result = model.predict(reshaped_sample)

 

We need the result to be either 0 or 1 and not with any decimal values. So we are going to take the nearest whole number value.

result =  round(result)
result

Output:

[0]

 

The predicted value is 0. Let us look at the ground truth.

ground_label = list(y_test)
ground_label[200]

Output:

0

 

Finally, we have arrived at the correct prediction.

We have built a classification model for the “Prediction of possibility of bookings” using TensorFlow.

 

THANK YOU

Leave a Reply

Your email address will not be published. Required fields are marked *