Predicting the life expectancy using TensorFlow
Hey all! In this tutorial, we will be predicting the life expectancy using TensorFlow. Here we will pre-process the data, do some data analysis and then predict the average life expectancy of a person using the TensorFlow regression model when we are given the other factors that influence the life expectancy.
Start Python program to Predict life expectancy using TensorFlow
First, let us import the necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') %matplotlib inline
Now let us now read the file and put it into a DataFrame. To access the dataset:https://www.kaggle.com/kumarajarshi/life-expectancy-who
df = pd.read_csv("C:/Users/madhumitha/Downloads/LifeExpectancyData.csv") df.head()
Let us also see how many null values are there in this dataset.
df.isnull().sum()
Output:
Country 0 Year 0 Status 0 LifeExpectancy 10 AdultMortality 10 InfantDeaths 0 Alcohol 194 PercentageExpenditure 0 HepatitisB 553 Measles 0 BMI 34 underfivedeaths 0 Polio 19 TotalExpenditure 226 Diphtheria 19 HIV/AIDS 0 GDP 448 Population 652 thinness1-19years 34 thinness5-9years 34 IncomeCompositionOfResources 167 Schooling 163 dtype: int64
We have so many null values in the dataset that needs to be imputed or dropped. But dropping the rows with null values is not a good option because it leads to the loss in the data. Thus we will try some methodology to impute the values.
Let’s impute the null values in certain columns with the corresponding column’s mean value.
df["LifeExpectancy"].fillna(df["LifeExpectancy"].mean(), inplace=True) df["AdultMortality"].fillna(df["AdultMortality"].mean(), inplace=True) df.dropna(subset = ["HepatitisB"], inplace=True)
import seaborn as sns print(df['AdultMortality'].mean()) sns.set(style="whitegrid") sns.boxplot(x=df['AdultMortality'])
This way, the box plots can be plotted for the other column variables to easily find the outliers and remove them from the dataset.
Let us also remove the identified outliers in the dataset in order to improve the accuracy of the model.
df=df.loc[df["Population"]<1093859294] df=df.loc[df['HepatitisB']>=30] df=df.loc[df['HIV/AIDS']<40] df=df.loc[df['TotalExpenditure']<15] df=df.loc[df['AdultMortality']<700]
Imputing the null values
Further, let us do more of data cleaning. Here we are going to impute the null values with the respective column mean and also we are going to drop the “year” column because it is not going to influence our results much.
df["Population"].fillna(df["Population"].mean(), inplace=True) df.dropna(subset = ["Alcohol"], inplace=True) df["GDP"].fillna(df["GDP"].mean(), inplace=True) df["BMI"].fillna(df["BMI"].mean(), inplace=True) df["TotalExpenditure"].fillna(df["TotalExpenditure"].mean(), inplace=True) df["Polio"].fillna(df["Polio"].mean(), inplace=True) df["Diphtheria"].fillna(df["Diphtheria"].mean(), inplace=True) df.drop('Year',axis=1)
For better accuracy, let us adopt a different method of imputing the values in the “Schooling” column which is nothing but “imputing using categorical mean”.
def impute_schooling(cols): s=cols[0] l=cols[1] if pd.isnull(s): if l<= 40: return 8.0 elif 40<l<=44: return 7.5 elif 44<l<50: return 8.1 elif 50<l<=60: return 8.2 elif 60<l<=70: return 10.5 elif 70<l<=80: return 13.4 elif l>80: return 16.5 else: return s df['Schooling']=df[['Schooling','LifeExpectancy']].apply(impute_schooling,axis=1)
Here we have actually considered the “Schooling” column and we have imputed it depending on it’s mean value in the life expectancy column. We are simply substituting the null values depending on their values in the “Life Expectancy” columns. We have chosen the “Life Expectancy” column values to impute because the two column variables are in strong correlation with one another in the heatmap.
Let us have a look at the heatmap also!
Now let us drop the very few remaining rows which still have the null values in some columns.
df=df.dropna() df.isnull().sum()
Now that we have arrived at a cleaned dataset.
We will now convert the strings into either 0 or 1 because there are only two categories in that column.
for ind in df.index: if df['Status'][ind] == 'Developed': df['Status'][ind] = 1 else: df['Status'][ind]= 0
Let us drop the “country” and “Life Expectancy” column from the x_train set and have only the “Life Expectancy” column in y_train set.
sample = df.drop(['Country','LifeExpectancy'], axis = 1) x = sample[['Year','Status', 'AdultMortality', 'InfantDeaths', 'Alcohol', 'PercentageExpenditure', 'HepatitisB', 'Measles', 'BMI', 'underfivedeaths', 'Polio', 'TotalExpenditure', 'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness1-19years', 'thinness5-9years', 'IncomeCompositionOfResources', 'Schooling']] y = df['LifeExpectancy'] #np.reshape(y,2103,1) print(x.shape) print(y.shape)
Just to know the shape of x and y to have a better idea, let us see the output.
(1497, 20) (1497,)
Building the model
Let us normalize the values to get better accuracy.
sample.columnsdf = sample.iloc[:,:] sample_norm = (sample - sample.mean()) / sample.std() y_norm = (y - y.mean()) / y.std()
Let us also write a function that can convert the values after prediction back from normalized form.
y_mean = df['LifeExpectancy'].mean() y_std = df['LifeExpectancy'].std() def convert_label_value(pred): return int(pred * y_std + y_mean)
Let us have a look at how the x and y values are like after normalization
X = sample_norm.iloc[:,:] X.head()
Output:
Year Status AdultMortality InfantDeaths Alcohol PercentageExpenditure HepatitisB Measles BMI underfivedeaths Polio TotalExpenditure Diphtheria HIV/AIDS GDP Population thinness1-19years thinness5-9years IncomeCompositionOfResources Schooling 0 1.738179 -0.428108 0.832587 0.417127 -1.11651 -0.351539 -1.464045 -0.082418 -0.995570 0.407639 -3.933121 0.948385 -1.297232 -0.323326 -0.432582 0.784003 2.838166 2.825139 -0.864523 -0.756515 1 1.490459 -0.428108 0.900689 0.440653 -1.11651 -0.350274 -1.673835 -0.154376 -1.021040 0.433666 -1.356216 0.957084 -1.475067 -0.323326 -0.430132 -0.369733 2.906474 2.870293 -0.881089 -0.792789 2 1.242738 -0.428108 0.875151 0.464178 -1.11651 -0.350446 -1.533975 -0.161116 -1.046510 0.459692 -1.157992 0.935335 -1.356511 -0.323326 -0.428491 0.714769 2.952012 2.915447 -0.914221 -0.829063 3 0.995018 -0.428108 0.909202 0.499467 -1.11651 -0.347647 -1.324185 0.095086 -1.071980 0.494393 -0.910213 1.104980 -1.178676 -0.323326 -0.425198 -0.253376 2.997551 2.983178 -0.952874 -0.865337 4 0.747298 -0.428108 0.934740 0.522992 -1.11651 -0.387717 -1.254255 0.119652 -1.092356 0.529095 -0.860657 0.822238 -1.119398 -0.323326 -0.477446 -0.278183 3.065858 3.028332 -1.002572 -0.974158
Y = y_norm Y.head()
The “y” value will be as follows:
0 -0.548800 1 -1.152274 2 -1.152274 3 -1.199605 4 -1.235103 Name: LifeExpectancy, dtype: float64
Now, we will split the train and test set and see the shape of the train and test data.
X_train, X_test, y_train, y_test = train_test_split(X_arr, Y_arr, test_size = 0.25, shuffle = True, random_state=0) X_train = np.asarray(X_train).astype(np.float32) X_test = np.asarray(X_test).astype(np.float32) y_train = np.asarray(y_train).astype(np.float32) y_test = np.asarray(y_test).astype(np.float32) print('X_train shape: ', X_train.shape) print('y_train shape: ', y_train.shape) print('X_test shape: ', X_test.shape) print('y_test shape: ', y_test.shape)
Output:
X_train shape: (1122, 20) y_train shape: (1122,) X_test shape: (375, 20) y_test shape: (375,
Now let us build the model inside a get_model() and call the function.
def get_model(): model = Sequential([ Dense(15, input_shape = (20,), activation = 'relu'), Dense(10, activation = 'relu'), #Dense(5, activation = 'relu'), Dense(1) ]) model.compile( loss='mse', optimizer='adam', ) return model model = get_model() model.summary()
Output:
Model: "sequential_122" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_431 (Dense) (None, 15) 315 _________________________________________________________________ dense_432 (Dense) (None, 10) 160 _________________________________________________________________ dense_433 (Dense) (None, 1) 11 ================================================================= Total params: 486 Trainable params: 486 Non-trainable params: 0
Now we will fit the model. Let us take just 15 epochs so that the model does not overfit.
early_stopping = EarlyStopping(monitor='val_loss', patience = 5) model = get_model() preds_on_untrained = model.predict(X_test) history = model.fit( X_train, y_train, validation_data = (X_test, y_test), epochs = 15, #validation_steps=1, callbacks = [early_stopping] )
Let us write a function to plot the model history:
def plot_history(history): loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s] val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s] acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s] val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s] if len(loss_list) == 0: print('Loss is missing in history') return ## As loss always exists epochs = range(1,len(history.history[loss_list[0]]) + 1) ## Loss plt.figure(1) for l in loss_list: plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.5f'))+')')) for l in val_loss_list: plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.5f'))+')')) plt.title('Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend()
For every epoch, the model loss is appended to an empty list and at last it is plotted.
For better understanding of how the model is fitted, let us see how the training and validation loss has decreased with increase in epochs.
plot_history(history)
The graph looks like the following:

model history
But how can we know that our model is better than an untrained model? The answer is : by plotting the untrained model’s prediction and our trained model’s predictions.
def plot_predictions(preds, y_test): plt.figure(figsize=(8, 8)) plt.plot(preds, y_test, 'ro') plt.xlabel('Preds') plt.ylabel('Labels') plt.xlim([-0.5, 0.5]) plt.ylim([-0.5, 0.5]) plt.plot([-0.5, 0.5], [-0.5, 0.5], 'b--') plt.show() return def compare_predictions(preds1, preds2, y_test): plt.figure(figsize=(8, 8)) plt.plot(preds1, y_test, 'ro', label='Untrained Model') plt.plot(preds2, y_test, 'go', label='Trained Model') plt.xlabel('Preds') plt.ylabel('Labels') y_min = min(min(y_test), min(preds1), min(preds2)) y_max = max(max(y_test), max(preds1), max(preds2)) plt.xlim([y_min, y_max]) plt.ylim([y_min, y_max]) plt.plot([y_min, y_max], [y_min, y_max], 'b--') plt.legend() plt.show() return
We have defined the functions to plot the predictions of an untrained model and a trained model against the ground truth (y_test).
preds_on_trained = model.predict(X_test) compare_predictions(preds_on_untrained, preds_on_trained, y_test)
Output:
The green points represent the trained model’s predictions and the red points represent the untrained model’s predictions.
Prediction
Now let us take any random row from the dataset and give the other column values (after normalisation) as input to get the life expectancy prediction from the model.
example =pd.DataFrame(sample.iloc[27,:]) example_norm= (example - example.mean()) / example.std()
Let us reshape the values of that row’s data.
example_norm= np.asarray(example_norm).astype(np.float32) np.reshape(example_norm,20) example_norm value = np.array(example_norm).reshape(1,20) value.shape final_values=model.predict(what) final_value
Output:
array([[0.21137413]], dtype=float32)
Now let us convert the value into the un normalised form.
ans = [convert_label_value(hi)] ans
Output:
[71]
Let us print the value of 27th row and see what it’s value of life expectancy is.
df.loc[27,["LifeExpectancy"]]
Output:
LifeExpectancy 73 Name: 27, dtype: object
Our predictions are close!! Thus we have predicted the average life expectancy of a person, given the other influencing factors for life expectancy using TensorFlow regression.
Leave a Reply