Ensemble technique using TensorFlow and Scikit learn

Welcome folks! In this blog we are going to learn how to implement an “Ensemble technique using TensorFlow and scikit learn”. We are going to predict the stock price values using an average ensemble of the predictions from TensorFlow regression and Scikit learn linear regression.

Data Preprocessing

Let us import the necessary libraries and packages along with the train and test dataset. The datasets are available at https://github.com/MadhumithaSrini/Stock-prediction-dataset

Consider the train data for training the model and treat the test data as the unseen data on which the model should finally predict.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("C:/Users/madhumitha/Downloads/TrainDatasetStockPrice.csv")

df_test = pd.read_csv("C:/Users/madhumitha/Downloads/TestDatasetStockPrice.csv")
df.columns

Output:

Index(['Stock Index', 'Index', 'Industry', 'VWAP', 'General Index', 'NAV',
       'P/E Ratio', 'Volumes Traded', 'Inventory Turnover',
       'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate',
       'Put-Call Ratio', 'P/B Ratio', 'Stock Price'],
      dtype='object')

 

Let us check out for the null values in the dataset.

df.isnull().sum()

Output:

Stock Index               0
Index                     0
Industry                  0
VWAP                     38
General Index            62
NAV                      61
P/E Ratio               234
Volumes Traded          268
Inventory Turnover      399
Covid Impact (Beta)     376
Tracking Error           71
Dollar Exchange Rate     77
Put-Call Ratio           85
P/B Ratio                25
Stock Price               0
dtype: int64

 

Now we need to impute the null values. So let us see the correlation between the features. The heatmap is as follows:

corr_df=df.corr()
corr_df
sns.heatmap(corr_df)

Output:

 

We see that there is no much correlation between the features. So we will impute the values of those columns which are comparatively more correlated with the output variable “Stock Price” using the mean of the respective columns. Other than that, let us drop the other rows with null values. Let us also clean the test data.

df["Inventory Turnover"].fillna(df["Inventory Turnover"].mean(), inplace=True)
df["Covid Impact (Beta)"].fillna(df["Covid Impact (Beta)"].mean(), inplace=True)
df["P/E Ratio"].fillna(df["P/E Ratio"].mean(), inplace=True)
df["NAV"].fillna(df["NAV"].mean(), inplace=True)


#final cleaning of training data

df= df.dropna()
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()


#cleaning test data

df_test = df_test.dropna()
df_test = df_test.replace([np.inf, -np.inf], np.nan)
df_test = df_test.dropna()

 

Also we don’t need the “Stock Index” and “index” columns because these numbers would not contribute much for prediction.

df = df.drop(columns = ['Stock Index', 'Index'])

 

We can notice that the “Industry” column alone contains strings. So we need to convert the strings to numbers. So let us see what the categories are and then replace them by numbers from 0 to (N-1) where N is the number of categories.

print(pd.Categorical(df['Industry']))

df['Industry'] = df['Industry'].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)


#Doing the same for the test set

print(pd.Categorical(df_test['Industry']))

df_test['Industry'] = df_test['Industry'].astype('category')
cat_columns = df_test.select_dtypes(['category']).columns
df_test[cat_columns] = df_test[cat_columns].apply(lambda x: x.cat.codes)

Output:

[Real Estate, Materials, Materials, Healthcare, Materials, ..., Materials, Healthcare, Materials, Materials, Materials]
Length: 6934
Categories (5, object): [Energy, Healthcare, Information Tech, Materials, Real Estate]



[Materials, Energy, Information Tech, Healthcare, Materials, ..., Healthcare, Information Tech, Energy, Healthcare, Information Tech]
Length: 2415
Categories (5, object): [Energy, Healthcare, Information Tech, Materials, Real Estate]

 

Let us now normalise the values. This step actually improves the model accuracy because the range within which the features vary is small, which makes it easy for the model to learn. Now let us normalise the values in the train and test set.

df_stats = df.describe()  
df_stats= df_stats.transpose()   # making the mean, std etc.. values of all columns to be in separate columns instead of having them as rows

df_norm = (df - df.mean()) / df.std() #normalising the values and creating a new DataFrame

 

Let us load the cleaned test set and normalise those values also.

#featurisation of test data and eliminating the unwanted columns

x_actual_test = df_test[['Industry', 'VWAP', 'General Index', 'NAV',
       'P/E Ratio', 'Volumes Traded', 'Inventory Turnover',
       'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate',
       'Put-Call Ratio', 'P/B Ratio']]


x_actual_stats = x_actual_test.describe()
x_actual_stats= x_actual_stats.transpose()
x_actual_stats

x_actual_norm = (x_actual_test - x_actual_test.mean()) / x_actual_test.std()

 

Now let us define the “x” and “y” variables for training the model.

x = df_norm[['Industry', 'VWAP', 'General Index', 'NAV',
       'P/E Ratio', 'Volumes Traded', 'Inventory Turnover',
       'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate',
       'Put-Call Ratio', 'P/B Ratio']]

y = df_norm['Stock Price']

print(x.shape)
print(y.shape)

Output:

(6934, 12)
(6934,)

 

Let us do the train_test_split using scikit learn.

from sklearn.model_selection import train_test_split         
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state= 1)

 

Building TensorFlow model

Let us import the necessary packages and then build the model.

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

from tensorflow.keras.layers import Dense, Dropout                      
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback    #for early stopping

def build_model():
    model = keras.Sequential([
    layers.Dense(10, activation='sigmoid', input_shape=(12,)),
    layers.Dense(5, activation='sigmoid'),
    layers.Dense(1, activation='relu')
  ])

    model.compile(loss = 'mae',
                optimizer = 'adam',
                metrics = ['mae'])
    return model

 

Now let us train the model.

model = build_model()

preds_on_untrained = model.predict(X_test)  # to just compare an untrained model's performance

history = model.fit(
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs = 50,
)
Epoch 1/50
174/174 [==============================] - 1s 4ms/step - loss: 0.8329 - mae: 0.8329 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 2/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 3/50
174/174 [==============================] - 0s 3ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 4/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 5/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 6/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 7/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 8/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 9/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8189 - mae: 0.8189 - val_loss: 0.8157 - val_mae: 0.8157
Epoch 10/50
174/174 [==============================] - 0s 2ms/step - loss: 0.8127 - mae: 0.8127 - val_loss: 0.7685 - val_mae: 0.7685
Epoch 11/50
174/174 [==============================] - 0s 2ms/step - loss: 0.6815 - mae: 0.6815 - val_loss: 0.5909 - val_mae: 0.5909
Epoch 12/50
174/174 [==============================] - 0s 2ms/step - loss: 0.5335 - mae: 0.5335 - val_loss: 0.5049 - val_mae: 0.5049
Epoch 13/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4887 - mae: 0.4887 - val_loss: 0.4864 - val_mae: 0.4864
Epoch 14/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4778 - mae: 0.4778 - val_loss: 0.4778 - val_mae: 0.4778
Epoch 15/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4728 - mae: 0.4728 - val_loss: 0.4729 - val_mae: 0.4729
Epoch 16/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4692 - mae: 0.4692 - val_loss: 0.4694 - val_mae: 0.4694
Epoch 17/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4661 - mae: 0.4661 - val_loss: 0.4669 - val_mae: 0.4669
Epoch 18/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4638 - mae: 0.4638 - val_loss: 0.4645 - val_mae: 0.4645
Epoch 19/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4615 - mae: 0.4615 - val_loss: 0.4624 - val_mae: 0.4624
Epoch 20/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4593 - mae: 0.4593 - val_loss: 0.4622 - val_mae: 0.4622
Epoch 21/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4577 - mae: 0.4577 - val_loss: 0.4589 - val_mae: 0.4589
Epoch 22/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4559 - mae: 0.4559 - val_loss: 0.4574 - val_mae: 0.4574
Epoch 23/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4537 - mae: 0.4537 - val_loss: 0.4560 - val_mae: 0.4560
Epoch 24/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4516 - mae: 0.4516 - val_loss: 0.4531 - val_mae: 0.4531
Epoch 25/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4493 - mae: 0.4493 - val_loss: 0.4506 - val_mae: 0.4506
Epoch 26/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4470 - mae: 0.4470 - val_loss: 0.4483 - val_mae: 0.4483
Epoch 27/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4445 - mae: 0.4445 - val_loss: 0.4460 - val_mae: 0.4460
Epoch 28/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4416 - mae: 0.4416 - val_loss: 0.4425 - val_mae: 0.4425
Epoch 29/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4391 - mae: 0.4391 - val_loss: 0.4395 - val_mae: 0.4395
Epoch 30/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4367 - mae: 0.4367 - val_loss: 0.4371 - val_mae: 0.4371
Epoch 31/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4347 - mae: 0.4347 - val_loss: 0.4348 - val_mae: 0.4348
Epoch 32/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4322 - mae: 0.4322 - val_loss: 0.4327 - val_mae: 0.4327
Epoch 33/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4306 - mae: 0.4306 - val_loss: 0.4311 - val_mae: 0.4311
Epoch 34/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4294 - mae: 0.4294 - val_loss: 0.4302 - val_mae: 0.4302
Epoch 35/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4287 - mae: 0.4287 - val_loss: 0.4300 - val_mae: 0.4300
Epoch 36/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4275 - mae: 0.4275 - val_loss: 0.4280 - val_mae: 0.4280
Epoch 37/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4265 - mae: 0.4265 - val_loss: 0.4270 - val_mae: 0.4270
Epoch 38/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4260 - mae: 0.4260 - val_loss: 0.4263 - val_mae: 0.4263
Epoch 39/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4252 - mae: 0.4252 - val_loss: 0.4251 - val_mae: 0.4251
Epoch 40/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4245 - mae: 0.4245 - val_loss: 0.4249 - val_mae: 0.4249
Epoch 41/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4239 - mae: 0.4239 - val_loss: 0.4249 - val_mae: 0.4249
Epoch 42/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4234 - mae: 0.4234 - val_loss: 0.4236 - val_mae: 0.4236
Epoch 43/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4228 - mae: 0.4228 - val_loss: 0.4231 - val_mae: 0.4231
Epoch 44/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4225 - mae: 0.4225 - val_loss: 0.4228 - val_mae: 0.4228
Epoch 45/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4217 - mae: 0.4217 - val_loss: 0.4223 - val_mae: 0.4223
Epoch 46/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4214 - mae: 0.4214 - val_loss: 0.4218 - val_mae: 0.4218
Epoch 47/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4209 - mae: 0.4209 - val_loss: 0.4212 - val_mae: 0.4212
Epoch 48/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4205 - mae: 0.4205 - val_loss: 0.4219 - val_mae: 0.4219
Epoch 49/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4202 - mae: 0.4202 - val_loss: 0.4210 - val_mae: 0.4210
Epoch 50/50
174/174 [==============================] - 0s 2ms/step - loss: 0.4199 - mae: 0.4199 - val_loss: 0.4211 - val_mae: 0.4211

 

Let us see the loss value:

loss = model.evaluate(X_test, y_test)
print("Loss is : ",loss)

Output:

44/44 [==============================] - 0s 1ms/step - loss: 0.4211 - mae: 0.4211
Loss is :  [0.4211407005786896, 0.4211407005786896]

 

Let us plot the validation and training loss.

def plot_history(history):
    loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s]
    val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s]
    acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s]
    val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s]
    
    if len(loss_list) == 0:
        print('Loss is missing in history')
        return 
    
    ## As loss always exists
    epochs = range(1,len(history.history[loss_list[0]]) + 1)
    
    ## Loss
    plt.figure(1)
    for l in loss_list:
        plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    for l in val_loss_list:
        plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
plot_history(history)

Output:

plot the validation and training loss

 

We will now look at how the model has captured the values. We will also plot the predictions of the trained model and compare it with an untrained model.

def plot_predictions(preds, y_test):
    plt.figure(figsize=(8, 8))
    plt.plot(preds, y_test, 'ro')
    plt.xlabel('Preds')
    plt.ylabel('Labels')
    plt.xlim([-0.5, 0.5])
    plt.ylim([-0.5, 0.5])
    plt.plot([-0.5, 0.5], [-0.5, 0.5], 'b--')
    plt.show()
    return

def compare_predictions(preds1, preds2, y_test):
    plt.figure(figsize=(8, 8))
    plt.plot(preds1, y_test, 'ro', label='Untrained Model')
    plt.plot(preds2, y_test, 'go', label='Trained Model')
    plt.xlabel('Preds')
    plt.ylabel('Labels')
    
    y_min = min(min(y_test), min(preds1), min(preds2))
    y_max = max(max(y_test), max(preds1), max(preds2))
    
    plt.xlim([y_min, y_max])
    plt.ylim([y_min, y_max])
    plt.plot([y_min, y_max], [y_min, y_max], 'b--')
    plt.legend()
    plt.show()
    return

 

On calling the function to plot the values, we have,

preds_on_trained = model.predict(X_test)
compare_predictions(preds_on_untrained, preds_on_trained, y_test)

Output:

plot the values

But the important part is that we need to predict the values for the actual test set. Let us move forward to do that.

ans_tf = model.predict(x_actual_norm)
ans_tf

Output:

array([[0.5942301],
       [0.       ],
       [2.3680222],
       ...,
       [0.       ],
       [0.       ],
       [1.7813871]], dtype=float32)

 

Here, we see that the values are in normalized form. We need to convert them back into an un-normalised form. So we will now write a function for it.

y_mean = df['Stock Price'].mean()
y_std = df['Stock Price'].std()

def convert_label_value(pred):
    return int(pred * y_std + y_mean)

We have just done the opposite of what we did while normalizing the values. Let us call the function.

tf_ans = []
for i in range(len(ans_tf)):
    temp = convert_label_value(ans_tf[i])
    tf_ans.append(temp)

tf_ans[:10]

Output:

[820, 569, 1569, 829, 752, 1414, 569, 569, 1744, 569]

 

Now, we have obtained the output from TensorFlow. Let us go on for predicting using Scikit learn.

 

Building Scikit-learn model to use it for ensembling along with TensorFlow model

Let us import the necessary libraries.

from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

 

Let us consider the values from the DataFrame as it is without normalising because this itself would give good results.

x = df[['Industry','VWAP', 'General Index', 'NAV',
       'P/E Ratio', 'Volumes Traded', 'Inventory Turnover',
       'Covid Impact (Beta)', 'Tracking Error', 'Dollar Exchange Rate',
       'Put-Call Ratio', 'P/B Ratio']]

y = df['Stock Price']

print(x.shape)
print(y.shape)

Output:

(6934, 12)
(6934,)

 

We will now build the model and go on for prediction.

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)

model_linear = LinearRegression()
model_linear.fit(x_train,y_train)

y_pred = model_linear.predict(x_test)

print("R^2 :" , r2_score(y_test,y_pred))

Output:

R^2 : 0.8850773853455763

 

Let us now have “yellowbrick” for visualizing the predictions and comparing it with the ground truth.

from yellowbrick.regressor import PredictionError, ResidualsPlot
visualizer = PredictionError(model_linear)

visualizer.fit(x_train, y_train)  # Fit the training data to the visualizer
visualizer.score(x_test, y_test)  # Evaluate the model on the validation data
visualizer.poof();

Output:

visualizing the predictions

 

 

The linear regression prediction on the actual test set is as follows:

y_linear_actual_pred = model_linear.predict(x_actual_test)
y_linear_actual_pred

Output:

array([ 862.47645798,  414.79836089, 1309.44416101, ...,  186.62937809,
        587.26500536, 1162.33573545])

Average ensemble technique

Now both the TensorFlow and Scikit-learn models are ready. Now let us move on with ensembling.

In order to achieve a better accuracy there are ensemble techniques like averaging, weighted averaging, boosting etc..

We will now see the average ensemble technique using TensorFlow and Scikit learn model predictions. It is nothing but considering the average values of predictions of both the models for predicted each value.

final_pred=(y_linear_actual_pred + tf_ans)/2
final_pred

 

The final predictions after ensembling is given as follows:

array([ 841.23822899,  491.89918045, 1439.22208051, ...,  377.81468905,
        578.13250268, 1242.16786772])

 

Cool! We have learned how to approach a regression problem using TensorFlow and Scikit learn and achieve better results using simple average ensembling method.

 

Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *