Boosted trees using Estimators in TensorFlow | Python
Hey fellow learner! Today let’s learn about Boosted Trees and how to implement them using Estimators in TensorFlow with Python programming.
Introduction to Boosted Trees
Boosted Trees are one of the most common and efficient methods for regression and classification. It’s a technique of ensembling and combining predictions from multiple models of the tree. It is a type of additive model that makes predictions by combining decisions from a sequence of base models.
Now let’s move to the implementation of the same. The flowchart for the same is as follows:
Implementation of Boosted Trees
1. Importing Modules
We import several modules such as pandas, NumPy, matplotlib, and TensorFlow:
import numpy as np import pandas as pd from IPython.display import clear_output import matplotlib.pyplot as plt import tensorflow as tf tf.random.set_seed(123)
2. Data Loading and Data Visualization
We read the two CSV files where one is for training and one is used as an eval dataset. We create input and output (x and y) data for each of them. We have to visualize the data too.
# Loading Data train_data = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') eval_data = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') out_train = train_data.pop('survived') out_eval = eval_data.pop('survived') #Data Visualization plt.figure(figsize=(10,10)) plt.subplot(2,2,1) train_data.n_siblings_spouses.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled') plt.title("Maximum people don't have children") plt.ylabel('count') plt.xlabel('no of siblings') plt.axvspan(0,0.5, color='#EC7063',alpha=0.5) plt.subplot(2,2,2) train_data.embark_town.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled') plt.xlabel('Town') plt.ylabel('count') plt.title("Maximum people are from Southampton") plt.axvspan(-0.03,0.2, color='#EC7063',alpha=0.5) plt.subplot(2,2,3) train_data.sex.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled') plt.xlabel('age') plt.ylabel('count') plt.title("Maximum people are male") plt.axvspan(-0.03,0.1, color='#EC7063',alpha=0.5) plt.subplot(2,2,4) train_data.fare.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled') plt.xlabel('fare') plt.ylabel('count') plt.title("Maximum people are bought cheap tickets") plt.axvspan(0,100, color='#EC7063',alpha=0.5) plt.tight_layout() plt.show() plt.figure(figsize=(10,10)) train_data.age.hist(bins=20,facecolor = '#1ABC9C', edgecolor='#F39C12',linewidth=1.5,density=True,histtype='stepfilled') plt.title("Maximum people in the age group 20-30") plt.axvspan(20,30, color='#EC7063',alpha=0.5) plt.xlabel('age') plt.ylabel('count') plt.show()
The results of the Data Visualization are as follows:
3. Categorizing columns
We have to separate columns on the basis of the type of values they have. The reason behind doing this is to make our model more efficient.
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck','embark_town', 'alone'] NUMERIC_COLUMNS = ['age', 'fare']
4. Building , Training and Evaluating the model
4.1 : Building the Model
This includes implementing tcomcepts like one-hot-encoding, normalization, and bucketization.
def one_hot_cat_column(feature_name, vocab): return tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(feature_name,vocab)) feature_columns = [] for feature_name in CATEGORICAL_COLUMNS: vocabulary = train_data[feature_name].unique() feature_columns.append(one_hot_cat_column(feature_name, vocabulary)) for feature_name in NUMERIC_COLUMNS: feature_columns.append(tf.feature_column.numeric_column(feature_name,dtype=tf.float32)) example = dict(train_data.head(1)) class_fc = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list('class', ('First', 'Second', 'Third'))) print('Feature value: "{}"'.format(example['class'].iloc[0])) print('One-hot encoded: ', tf.keras.layers.DenseFeatures([class_fc])(example).numpy()) tf.keras.layers.DenseFeatures(feature_columns)(example).numpy()
The results of the code above are as follows:
Feature value: "Third" One-hot encoded: [[0. 0. 1.]] array([[22. , 1. , 0. , 1. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 7.25, 1. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 1. , 0. ]], dtype=float32)
The code for further processing after one-hot-encoding is as follow:
NUM_EXAMPLES = len(out_train) def make_input_fn(X, y, n_epochs=None, shuffle=True): def input_fn(): dataset = tf.data.Dataset.from_tensor_slices((dict(X), y)) if shuffle: dataset = dataset.shuffle(NUM_EXAMPLES) dataset = dataset.repeat(n_epochs) dataset = dataset.batch(NUM_EXAMPLES) return dataset return input_fn train_input_fn = make_input_fn(train_data, out_train) eval_input_fn = make_input_fn(eval_data, out_eval, shuffle=False, n_epochs=1)
4.2: Training and Evaluating the model
The flowchart for training and Evaluating the data we have the following flowchart:
First we train the linear estimator which is shown below:
linear_est = tf.estimator.LinearClassifier(feature_columns) linear_est.train(train_input_fn, max_steps=100) result = linear_est.evaluate(eval_input_fn) clear_output() print(pd.Series(result))
The output is as follows:
accuracy 0.765152 accuracy_baseline 0.625000 auc 0.832844 auc_precision_recall 0.789631 average_loss 0.478908 label/mean 0.375000 loss 0.478908 precision 0.703297 prediction/mean 0.350790 recall 0.646465 global_step 100.000000 dtype: float64
Then we train the estimator and evaluate the results:
n_batches = 1 est = tf.estimator.BoostedTreesClassifier(feature_columns,n_batches_per_layer=n_batches) est.train(train_input_fn, max_steps=100) result = est.evaluate(eval_input_fn) clear_output() print(pd.Series(result))
The output of the second evaluation is as follows:
accuracy 0.833333 accuracy_baseline 0.625000 auc 0.874931 auc_precision_recall 0.859920 average_loss 0.405004 label/mean 0.375000 loss 0.405004 precision 0.795699 prediction/mean 0.383333 recall 0.747475 global_step 100.000000 dtype: float64
5. Making Final Predictions and Visualizing it
The code below shows the code for final predictions made and visualization of the same:
plt.figure(figsize=(10,10)) pred_dicts = list(est.predict(eval_input_fn)) probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts]) probs.plot(kind='hist', bins=20, title='predicted probabilities',facecolor = '#F1C40F', edgecolor='#21618C',linewidth=1.5,histtype='stepfilled') plt.show()
The final results of the code is shown below:
6. Receiver operating characteristic (ROC) Plot
We plot the ROC plot to know more about our results:
from sklearn.metrics import roc_curve plt.figure(figsize=(10,10)) fpr, tpr, _ = roc_curve(out_eval, probs) plt.plot(fpr, tpr,color="red") plt.title('ROC curve') plt.xlabel('false positive rate') plt.ylabel('true positive rate') plt.xlim(0,) plt.ylim(0,) plt.show()
The final ROC plot is shown below:
Conclusion
You have successfully completed the implementation of Boosted trees using Estimators in Tensorflow.
Congratulations! Thank you for reading!
Keep reading to learn more!
You can find the code here.
Also Read:
- Learn Classification of clothing images using TensorFlow in Python
- Logistic Regression in TensorFlow
- Phishing website detection using auto-encoders in Keras using Python
- Load CSV Data From URL in TensorFlow
Leave a Reply