Boosted trees using Estimators in TensorFlow | Python

Hey fellow learner! Today let’s learn about Boosted Trees and how to implement them using Estimators in TensorFlow with Python programming.

Introduction to Boosted Trees

Boosted Trees are one of the most common and efficient methods for regression and classification. It’s a technique of ensembling and combining predictions from multiple models of the tree. It is a type of additive model that makes predictions by combining decisions from a sequence of base models.

Now let’s move to the implementation of the same. The flowchart for the same is as follows:

boosted trees

Implementation of Boosted Trees

1. Importing Modules

We import several modules such as pandas, NumPy, matplotlib, and TensorFlow:

import numpy as np
import pandas as pd
from IPython.display import clear_output
import matplotlib.pyplot as plt
import tensorflow as tf
tf.random.set_seed(123)

2. Data Loading and Data Visualization

We read the two CSV files where one is for training and one is used as an eval dataset. We create input and output (x and y) data for each of them. We have to visualize the data too.

# Loading Data
train_data = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
eval_data = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
out_train = train_data.pop('survived')
out_eval = eval_data.pop('survived')

#Data Visualization
plt.figure(figsize=(10,10))

plt.subplot(2,2,1)
train_data.n_siblings_spouses.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled')
plt.title("Maximum people don't have children")
plt.ylabel('count')
plt.xlabel('no of siblings')
plt.axvspan(0,0.5, color='#EC7063',alpha=0.5)

plt.subplot(2,2,2)
train_data.embark_town.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled')
plt.xlabel('Town')
plt.ylabel('count')
plt.title("Maximum people are from Southampton")
plt.axvspan(-0.03,0.2, color='#EC7063',alpha=0.5)

plt.subplot(2,2,3)
train_data.sex.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled')
plt.xlabel('age')
plt.ylabel('count')
plt.title("Maximum people are male")
plt.axvspan(-0.03,0.1, color='#EC7063',alpha=0.5)

plt.subplot(2,2,4)
train_data.fare.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled')
plt.xlabel('fare')
plt.ylabel('count')
plt.title("Maximum people are bought cheap tickets")
plt.axvspan(0,100, color='#EC7063',alpha=0.5)

plt.tight_layout()
plt.show()

plt.figure(figsize=(10,10))
train_data.age.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled')
plt.title("Maximum people in the age group 20-30")
plt.axvspan(20,30, color='#EC7063',alpha=0.5)
plt.xlabel('age')
plt.ylabel('count')
plt.show()

The results of the Data Visualization are as follows:

plot1plot2

3. Categorizing columns

We have to separate columns on the basis of the type of values they have. The reason behind doing this is to make our model more efficient.

CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck','embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

4. Building , Training and Evaluating the model

4.1 : Building the Model

This includes implementing tcomcepts like one-hot-encoding, normalization, and bucketization.

def one_hot_cat_column(feature_name, vocab):
    return tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(feature_name,vocab))
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
    vocabulary = train_data[feature_name].unique()
    feature_columns.append(one_hot_cat_column(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
    feature_columns.append(tf.feature_column.numeric_column(feature_name,dtype=tf.float32))
example = dict(train_data.head(1))
class_fc = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list('class', ('First', 'Second', 'Third')))
print('Feature value: "{}"'.format(example['class'].iloc[0]))
print('One-hot encoded: ', tf.keras.layers.DenseFeatures([class_fc])(example).numpy())

tf.keras.layers.DenseFeatures(feature_columns)(example).numpy()

The results of the code above are as follows:

Feature value: "Third"
One-hot encoded:  [[0. 0. 1.]]

array([[22.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,
         0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,  0.  ,
         7.25,  1.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,
         0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ]], dtype=float32)

The code for further processing after one-hot-encoding is as follow:

NUM_EXAMPLES = len(out_train)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
    def input_fn():
        dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
        if shuffle:
            dataset = dataset.shuffle(NUM_EXAMPLES)
        dataset = dataset.repeat(n_epochs)
        dataset = dataset.batch(NUM_EXAMPLES)
        return dataset
    return input_fn
train_input_fn = make_input_fn(train_data, out_train)
eval_input_fn = make_input_fn(eval_data, out_eval, shuffle=False, n_epochs=1)
4.2: Training and Evaluating the model

The flowchart for training and Evaluating the data we have the following flowchart:

plot4

First we train the linear estimator which is shown below:

linear_est = tf.estimator.LinearClassifier(feature_columns)
linear_est.train(train_input_fn, max_steps=100)
result = linear_est.evaluate(eval_input_fn)
clear_output()
print(pd.Series(result))

The output is as follows:

accuracy                  0.765152
accuracy_baseline         0.625000
auc                       0.832844
auc_precision_recall      0.789631
average_loss              0.478908
label/mean                0.375000
loss                      0.478908
precision                 0.703297
prediction/mean           0.350790
recall                    0.646465
global_step             100.000000
dtype: float64

Then we train the estimator and evaluate the results:

n_batches = 1
est = tf.estimator.BoostedTreesClassifier(feature_columns,n_batches_per_layer=n_batches)
est.train(train_input_fn, max_steps=100)
result = est.evaluate(eval_input_fn)
clear_output()
print(pd.Series(result))

The output of the second evaluation is as follows:

accuracy                  0.833333
accuracy_baseline         0.625000
auc                       0.874931
auc_precision_recall      0.859920
average_loss              0.405004
label/mean                0.375000
loss                      0.405004
precision                 0.795699
prediction/mean           0.383333
recall                    0.747475
global_step             100.000000
dtype: float64

5. Making Final Predictions and Visualizing it

The code below shows the code for final predictions made and visualization of the same:

plt.figure(figsize=(10,10))
pred_dicts = list(est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])
probs.plot(kind='hist', bins=20, title='predicted probabilities',facecolor = '#F1C40F', edgecolor='#21618C',linewidth=1.5,histtype='stepfilled')
plt.show()

The final results of the code is shown below:

output

6. Receiver operating characteristic (ROC) Plot

We plot the ROC plot to know more about our results:

from sklearn.metrics import roc_curve
plt.figure(figsize=(10,10))
fpr, tpr, _ = roc_curve(out_eval, probs)
plt.plot(fpr, tpr,color="red")
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)
plt.show()

The final ROC plot is shown below:

ROC

Conclusion

You have successfully completed the implementation of Boosted trees using Estimators in Tensorflow.

Congratulations! Thank you for reading!

Keep reading to learn more!

You can find the code here.

Also Read:

Leave a Reply

Your email address will not be published. Required fields are marked *