Ensemble technique using TensorFlow and Scikit learn
Welcome folks! In this blog we are going to learn how to implement an “Ensemble technique using TensorFlow and scikit learn”. We are going to predict the stock price values using an average ensemble of the predictions from TensorFlow regression and Scikit learn linear regression.
Data Preprocessing
Let us import the necessary libraries and packages along with the train and test dataset. The datasets are available at https://github.com/MadhumithaSrini/Stock-prediction-dataset
Consider the train data for training the model and treat the test data as the unseen data on which the model should finally predict.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') df = pd.read_csv("C:/Users/madhumitha/Downloads/TrainDatasetStockPrice.csv") df_test = pd.read_csv("C:/Users/madhumitha/Downloads/TestDatasetStockPrice.csv") df.columns
Output:
Index(['Stock Index', 'Index', 'Industry', 'VWAP', 'General Index', 'NAV', 'P/E Ratio', 'Volumes Traded', 'Inventory Turnover', 'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate', 'Put-Call Ratio', 'P/B Ratio', 'Stock Price'], dtype='object')
Let us check out for the null values in the dataset.
df.isnull().sum()
Output:
Stock Index 0 Index 0 Industry 0 VWAP 38 General Index 62 NAV 61 P/E Ratio 234 Volumes Traded 268 Inventory Turnover 399 Covid Impact (Beta) 376 Tracking Error 71 Dollar Exchange Rate 77 Put-Call Ratio 85 P/B Ratio 25 Stock Price 0 dtype: int64
Now we need to impute the null values. So let us see the correlation between the features. The heatmap is as follows:
corr_df=df.corr() corr_df sns.heatmap(corr_df)
Output:
We see that there is no much correlation between the features. So we will impute the values of those columns which are comparatively more correlated with the output variable “Stock Price” using the mean of the respective columns. Other than that, let us drop the other rows with null values. Let us also clean the test data.
df["Inventory Turnover"].fillna(df["Inventory Turnover"].mean(), inplace=True) df["Covid Impact (Beta)"].fillna(df["Covid Impact (Beta)"].mean(), inplace=True) df["P/E Ratio"].fillna(df["P/E Ratio"].mean(), inplace=True) df["NAV"].fillna(df["NAV"].mean(), inplace=True) #final cleaning of training data df= df.dropna() df = df.replace([np.inf, -np.inf], np.nan) df = df.dropna() #cleaning test data df_test = df_test.dropna() df_test = df_test.replace([np.inf, -np.inf], np.nan) df_test = df_test.dropna()
Also we don’t need the “Stock Index” and “index” columns because these numbers would not contribute much for prediction.
df = df.drop(columns = ['Stock Index', 'Index'])
We can notice that the “Industry” column alone contains strings. So we need to convert the strings to numbers. So let us see what the categories are and then replace them by numbers from 0 to (N-1) where N is the number of categories.
print(pd.Categorical(df['Industry'])) df['Industry'] = df['Industry'].astype('category') cat_columns = df.select_dtypes(['category']).columns df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes) #Doing the same for the test set print(pd.Categorical(df_test['Industry'])) df_test['Industry'] = df_test['Industry'].astype('category') cat_columns = df_test.select_dtypes(['category']).columns df_test[cat_columns] = df_test[cat_columns].apply(lambda x: x.cat.codes)
Output:
[Real Estate, Materials, Materials, Healthcare, Materials, ..., Materials, Healthcare, Materials, Materials, Materials] Length: 6934 Categories (5, object): [Energy, Healthcare, Information Tech, Materials, Real Estate] [Materials, Energy, Information Tech, Healthcare, Materials, ..., Healthcare, Information Tech, Energy, Healthcare, Information Tech] Length: 2415 Categories (5, object): [Energy, Healthcare, Information Tech, Materials, Real Estate]
Let us now normalise the values. This step actually improves the model accuracy because the range within which the features vary is small, which makes it easy for the model to learn. Now let us normalise the values in the train and test set.
df_stats = df.describe() df_stats= df_stats.transpose() # making the mean, std etc.. values of all columns to be in separate columns instead of having them as rows df_norm = (df - df.mean()) / df.std() #normalising the values and creating a new DataFrame
Let us load the cleaned test set and normalise those values also.
#featurisation of test data and eliminating the unwanted columns x_actual_test = df_test[['Industry', 'VWAP', 'General Index', 'NAV', 'P/E Ratio', 'Volumes Traded', 'Inventory Turnover', 'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate', 'Put-Call Ratio', 'P/B Ratio']] x_actual_stats = x_actual_test.describe() x_actual_stats= x_actual_stats.transpose() x_actual_stats x_actual_norm = (x_actual_test - x_actual_test.mean()) / x_actual_test.std()
Now let us define the “x” and “y” variables for training the model.
x = df_norm[['Industry', 'VWAP', 'General Index', 'NAV', 'P/E Ratio', 'Volumes Traded', 'Inventory Turnover', 'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate', 'Put-Call Ratio', 'P/B Ratio']] y = df_norm['Stock Price'] print(x.shape) print(y.shape)
Output:
(6934, 12) (6934,)
Let us do the train_test_split using scikit learn.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state= 1)
Building TensorFlow model
Let us import the necessary packages and then build the model.
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.layers import Dense, Dropout from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback #for early stopping def build_model(): model = keras.Sequential([ layers.Dense(10, activation='sigmoid', input_shape=(12,)), layers.Dense(5, activation='sigmoid'), layers.Dense(1, activation='relu') ]) model.compile(loss = 'mae', optimizer = 'adam', metrics = ['mae']) return model
Now let us train the model.
model = build_model() preds_on_untrained = model.predict(X_test) # to just compare an untrained model's performance history = model.fit( X_train, y_train, validation_data = (X_test, y_test), epochs = 50, )
Epoch 1/50 174/174 [==============================] - 1s 4ms/step - loss: 0.8329 - mae: 0.8329 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 2/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 3/50 174/174 [==============================] - 0s 3ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 4/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 5/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 6/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 7/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 8/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 9/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157 Epoch 10/50 174/174 [==============================] - 0s 2ms/step - loss: 0.8127 - mae: 0.8127 - val_loss: 0.7685 - val_mae: 0.7685 Epoch 11/50 174/174 [==============================] - 0s 2ms/step - loss: 0.6815 - mae: 0.6815 - val_loss: 0.5909 - val_mae: 0.5909 Epoch 12/50 174/174 [==============================] - 0s 2ms/step - loss: 0.5335 - mae: 0.5335 - val_loss: 0.5049 - val_mae: 0.5049 Epoch 13/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4887 - mae: 0.4887 - val_loss: 0.4864 - val_mae: 0.4864 Epoch 14/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4778 - mae: 0.4778 - val_loss: 0.4778 - val_mae: 0.4778 Epoch 15/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4728 - mae: 0.4728 - val_loss: 0.4729 - val_mae: 0.4729 Epoch 16/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4692 - mae: 0.4692 - val_loss: 0.4694 - val_mae: 0.4694 Epoch 17/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4661 - mae: 0.4661 - val_loss: 0.4669 - val_mae: 0.4669 Epoch 18/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4638 - mae: 0.4638 - val_loss: 0.4645 - val_mae: 0.4645 Epoch 19/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4615 - mae: 0.4615 - val_loss: 0.4624 - val_mae: 0.4624 Epoch 20/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4593 - mae: 0.4593 - val_loss: 0.4622 - val_mae: 0.4622 Epoch 21/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4577 - mae: 0.4577 - val_loss: 0.4589 - val_mae: 0.4589 Epoch 22/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4559 - mae: 0.4559 - val_loss: 0.4574 - val_mae: 0.4574 Epoch 23/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4537 - mae: 0.4537 - val_loss: 0.4560 - val_mae: 0.4560 Epoch 24/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4516 - mae: 0.4516 - val_loss: 0.4531 - val_mae: 0.4531 Epoch 25/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4493 - mae: 0.4493 - val_loss: 0.4506 - val_mae: 0.4506 Epoch 26/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4470 - mae: 0.4470 - val_loss: 0.4483 - val_mae: 0.4483 Epoch 27/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4445 - mae: 0.4445 - val_loss: 0.4460 - val_mae: 0.4460 Epoch 28/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4416 - mae: 0.4416 - val_loss: 0.4425 - val_mae: 0.4425 Epoch 29/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4391 - mae: 0.4391 - val_loss: 0.4395 - val_mae: 0.4395 Epoch 30/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4367 - mae: 0.4367 - val_loss: 0.4371 - val_mae: 0.4371 Epoch 31/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4347 - mae: 0.4347 - val_loss: 0.4348 - val_mae: 0.4348 Epoch 32/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4322 - mae: 0.4322 - val_loss: 0.4327 - val_mae: 0.4327 Epoch 33/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4306 - mae: 0.4306 - val_loss: 0.4311 - val_mae: 0.4311 Epoch 34/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4294 - mae: 0.4294 - val_loss: 0.4302 - val_mae: 0.4302 Epoch 35/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4287 - mae: 0.4287 - val_loss: 0.4300 - val_mae: 0.4300 Epoch 36/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4275 - mae: 0.4275 - val_loss: 0.4280 - val_mae: 0.4280 Epoch 37/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4265 - mae: 0.4265 - val_loss: 0.4270 - val_mae: 0.4270 Epoch 38/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4260 - mae: 0.4260 - val_loss: 0.4263 - val_mae: 0.4263 Epoch 39/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4252 - mae: 0.4252 - val_loss: 0.4251 - val_mae: 0.4251 Epoch 40/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4245 - mae: 0.4245 - val_loss: 0.4249 - val_mae: 0.4249 Epoch 41/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4239 - mae: 0.4239 - val_loss: 0.4249 - val_mae: 0.4249 Epoch 42/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4234 - mae: 0.4234 - val_loss: 0.4236 - val_mae: 0.4236 Epoch 43/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4228 - mae: 0.4228 - val_loss: 0.4231 - val_mae: 0.4231 Epoch 44/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4225 - mae: 0.4225 - val_loss: 0.4228 - val_mae: 0.4228 Epoch 45/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4217 - mae: 0.4217 - val_loss: 0.4223 - val_mae: 0.4223 Epoch 46/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4214 - mae: 0.4214 - val_loss: 0.4218 - val_mae: 0.4218 Epoch 47/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4209 - mae: 0.4209 - val_loss: 0.4212 - val_mae: 0.4212 Epoch 48/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4205 - mae: 0.4205 - val_loss: 0.4219 - val_mae: 0.4219 Epoch 49/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4202 - mae: 0.4202 - val_loss: 0.4210 - val_mae: 0.4210 Epoch 50/50 174/174 [==============================] - 0s 2ms/step - loss: 0.4199 - mae: 0.4199 - val_loss: 0.4211 - val_mae: 0.4211
Let us see the loss value:
loss = model.evaluate(X_test, y_test) print("Loss is : ",loss)
Output:
44/44 [==============================] - 0s 1ms/step - loss: 0.4211 - mae: 0.4211 Loss is : [0.4211407005786896, 0.4211407005786896]
Let us plot the validation and training loss.
def plot_history(history): loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s] val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s] acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s] val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s] if len(loss_list) == 0: print('Loss is missing in history') return ## As loss always exists epochs = range(1,len(history.history[loss_list[0]]) + 1) ## Loss plt.figure(1) for l in loss_list: plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.5f'))+')')) for l in val_loss_list: plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.5f'))+')')) plt.title('Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plot_history(history)
Output:
We will now look at how the model has captured the values. We will also plot the predictions of the trained model and compare it with an untrained model.
def plot_predictions(preds, y_test): plt.figure(figsize=(8, 8)) plt.plot(preds, y_test, 'ro') plt.xlabel('Preds') plt.ylabel('Labels') plt.xlim([-0.5, 0.5]) plt.ylim([-0.5, 0.5]) plt.plot([-0.5, 0.5], [-0.5, 0.5], 'b--') plt.show() return def compare_predictions(preds1, preds2, y_test): plt.figure(figsize=(8, 8)) plt.plot(preds1, y_test, 'ro', label='Untrained Model') plt.plot(preds2, y_test, 'go', label='Trained Model') plt.xlabel('Preds') plt.ylabel('Labels') y_min = min(min(y_test), min(preds1), min(preds2)) y_max = max(max(y_test), max(preds1), max(preds2)) plt.xlim([y_min, y_max]) plt.ylim([y_min, y_max]) plt.plot([y_min, y_max], [y_min, y_max], 'b--') plt.legend() plt.show() return
On calling the function to plot the values, we have,
preds_on_trained = model.predict(X_test) compare_predictions(preds_on_untrained, preds_on_trained, y_test)
Output:
But the important part is that we need to predict the values for the actual test set. Let us move forward to do that.
ans_tf = model.predict(x_actual_norm) ans_tf
Output:
array([[0.5942301], [0. ], [2.3680222], ..., [0. ], [0. ], [1.7813871]], dtype=float32)
Here, we see that the values are in normalized form. We need to convert them back into an un-normalised form. So we will now write a function for it.
y_mean = df['Stock Price'].mean() y_std = df['Stock Price'].std() def convert_label_value(pred): return int(pred * y_std + y_mean)
We have just done the opposite of what we did while normalizing the values. Let us call the function.
tf_ans = [] for i in range(len(ans_tf)): temp = convert_label_value(ans_tf[i]) tf_ans.append(temp) tf_ans[:10]
Output:
[820, 569, 1569, 829, 752, 1414, 569, 569, 1744, 569]
Now, we have obtained the output from TensorFlow. Let us go on for predicting using Scikit learn.
Building Scikit-learn model to use it for ensembling along with TensorFlow model
Let us import the necessary libraries.
from sklearn.linear_model import LinearRegression from sklearn import metrics from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score
Let us consider the values from the DataFrame as it is without normalising because this itself would give good results.
x = df[['Industry','VWAP', 'General Index', 'NAV', 'P/E Ratio', 'Volumes Traded', 'Inventory Turnover', 'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate', 'Put-Call Ratio', 'P/B Ratio']] y = df['Stock Price'] print(x.shape) print(y.shape)
Output:
(6934, 12) (6934,)
We will now build the model and go on for prediction.
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1) model_linear = LinearRegression() model_linear.fit(x_train,y_train) y_pred = model_linear.predict(x_test) print("R^2 :" , r2_score(y_test,y_pred))
Output:
R^2 : 0.8850773853455763
Let us now have “yellowbrick” for visualizing the predictions and comparing it with the ground truth.
from yellowbrick.regressor import PredictionError, ResidualsPlot visualizer = PredictionError(model_linear) visualizer.fit(x_train, y_train) # Fit the training data to the visualizer visualizer.score(x_test, y_test) # Evaluate the model on the validation data visualizer.poof();
Output:
The linear regression prediction on the actual test set is as follows:
y_linear_actual_pred = model_linear.predict(x_actual_test) y_linear_actual_pred
Output:
array([ 862.47645798, 414.79836089, 1309.44416101, ..., 186.62937809, 587.26500536, 1162.33573545])
Average ensemble technique
Now both the TensorFlow and Scikit-learn models are ready. Now let us move on with ensembling.
In order to achieve a better accuracy there are ensemble techniques like averaging, weighted averaging, boosting etc..
We will now see the average ensemble technique using TensorFlow and Scikit learn model predictions. It is nothing but considering the average values of predictions of both the models for predicted each value.
final_pred=(y_linear_actual_pred + tf_ans)/2 final_pred
The final predictions after ensembling is given as follows:
array([ 841.23822899, 491.89918045, 1439.22208051, ..., 377.81468905, 578.13250268, 1242.16786772])
Cool! We have learned how to approach a regression problem using TensorFlow and Scikit learn and achieve better results using simple average ensembling method.
Thank you
Leave a Reply