Predicting the life expectancy using TensorFlow

Hey all! In this tutorial, we will be predicting the life expectancy using TensorFlow. Here we will pre-process the data, do some data analysis and then predict the average life expectancy of a person using the TensorFlow regression model when we are given the other factors that influence the life expectancy.

Start Python program to Predict life expectancy using TensorFlow

First, let us import the necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

 

Now let us now read the file and put it into a DataFrame. To access the dataset:https://www.kaggle.com/kumarajarshi/life-expectancy-who

df = pd.read_csv("C:/Users/madhumitha/Downloads/LifeExpectancyData.csv")
df.head()

 

Let us also see how many null values are there in this dataset.

df.isnull().sum()

Output:

Country                           0
Year                              0
Status                            0
LifeExpectancy                   10
AdultMortality                   10
InfantDeaths                      0
Alcohol                         194
PercentageExpenditure             0
HepatitisB                      553
Measles                           0
BMI                              34
underfivedeaths                   0
Polio                            19
TotalExpenditure                226
Diphtheria                       19
HIV/AIDS                          0
GDP                             448
Population                      652
thinness1-19years                34
thinness5-9years                 34
IncomeCompositionOfResources    167
Schooling                       163
dtype: int64

We have so many null values in the dataset that needs to be imputed or dropped. But dropping the rows with null values is not a good option because it leads to the loss in the data. Thus we will try some methodology to impute the values.

 

Let’s impute the null values in certain columns with the corresponding column’s mean value.

df["LifeExpectancy"].fillna(df["LifeExpectancy"].mean(), inplace=True)
df["AdultMortality"].fillna(df["AdultMortality"].mean(), inplace=True)
df.dropna(subset = ["HepatitisB"], inplace=True)

 

import seaborn as sns
print(df['AdultMortality'].mean())
sns.set(style="whitegrid")
sns.boxplot(x=df['AdultMortality'])

Adult mortality rate- box plot

 

This way, the box plots can be plotted for the other column variables to easily find the outliers and remove them from the dataset.

 

Let us also remove the identified outliers in the dataset in order to improve the accuracy of the model.

df=df.loc[df["Population"]<1093859294]
df=df.loc[df['HepatitisB']>=30]
df=df.loc[df['HIV/AIDS']<40]
df=df.loc[df['TotalExpenditure']<15]
df=df.loc[df['AdultMortality']<700]

 

Imputing the null values

Further, let us do more of data cleaning. Here we are going to impute the null values with the respective column mean and also we are going to drop the “year” column because it is not going to influence our results much.

df["Population"].fillna(df["Population"].mean(), inplace=True)
df.dropna(subset = ["Alcohol"], inplace=True)
df["GDP"].fillna(df["GDP"].mean(), inplace=True)
df["BMI"].fillna(df["BMI"].mean(), inplace=True)
df["TotalExpenditure"].fillna(df["TotalExpenditure"].mean(), inplace=True)
df["Polio"].fillna(df["Polio"].mean(), inplace=True)
df["Diphtheria"].fillna(df["Diphtheria"].mean(), inplace=True)
df.drop('Year',axis=1)

 

For better accuracy, let us adopt a different method of imputing the values in the “Schooling” column which is nothing but “imputing using categorical mean”.

def impute_schooling(cols):
    s=cols[0]
    l=cols[1]
    if pd.isnull(s):
        if l<= 40:
            return 8.0
        elif 40<l<=44:
            return 7.5
        elif 44<l<50:
            return 8.1
        elif 50<l<=60:
            return 8.2
        elif 60<l<=70:
            return 10.5
        elif 70<l<=80:
            return 13.4
        elif l>80:
            return 16.5
    else:
        return s
    
df['Schooling']=df[['Schooling','LifeExpectancy']].apply(impute_schooling,axis=1)

 

Here we have actually considered the “Schooling” column and we have imputed it depending on it’s mean value in the life expectancy column. We are simply substituting the null values depending on their values in the “Life Expectancy” columns. We have chosen the “Life Expectancy” column values to impute because the two column variables are in strong correlation with one another in the heatmap.

Let us have a look at the heatmap also!

Heatmap to show the correlation of the features

 

Now let us drop the very few remaining rows which still have the null values in some columns.

df=df.dropna()
df.isnull().sum()

Now that we have arrived at a cleaned dataset.

 

We will now convert the strings into either 0 or 1 because there are only two categories in that column.

for ind in df.index:
    if df['Status'][ind] == 'Developed':
        df['Status'][ind] = 1
    else:
        df['Status'][ind]= 0

 

Let us drop the “country” and “Life Expectancy” column from the x_train set and have only the “Life Expectancy” column in y_train set.

sample = df.drop(['Country','LifeExpectancy'], axis = 1)
x = sample[['Year','Status', 'AdultMortality', 'InfantDeaths', 'Alcohol',
       'PercentageExpenditure', 'HepatitisB', 'Measles', 'BMI',
       'underfivedeaths', 'Polio', 'TotalExpenditure', 'Diphtheria',
       'HIV/AIDS', 'GDP', 'Population', 'thinness1-19years',
       'thinness5-9years', 'IncomeCompositionOfResources', 'Schooling']]
y = df['LifeExpectancy']
#np.reshape(y,2103,1)
print(x.shape)
print(y.shape)

Just to know the shape of x and y to have a better idea, let us see the output.

(1497, 20)
(1497,)

 

Building the model

Let us normalize the values to get better accuracy.

sample.columnsdf = sample.iloc[:,:]
sample_norm = (sample - sample.mean()) / sample.std()
y_norm = (y - y.mean()) / y.std()

Let us also write a function that can convert the values after prediction back from normalized form.

y_mean = df['LifeExpectancy'].mean()
y_std = df['LifeExpectancy'].std()

def convert_label_value(pred):
    return int(pred * y_std + y_mean)

 

Let us have a look at how the x and y values are like after normalization

X = sample_norm.iloc[:,:]
X.head()

Output:

    Year	Status	AdultMortality	InfantDeaths	Alcohol	PercentageExpenditure	HepatitisB	Measles	BMI	underfivedeaths	Polio	TotalExpenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness1-19years	thinness5-9years	IncomeCompositionOfResources	Schooling
0	1.738179	-0.428108	0.832587	0.417127	-1.11651	-0.351539	-1.464045	-0.082418	-0.995570	0.407639	-3.933121	0.948385	-1.297232	-0.323326	-0.432582	0.784003	2.838166	2.825139	-0.864523	-0.756515
1	1.490459	-0.428108	0.900689	0.440653	-1.11651	-0.350274	-1.673835	-0.154376	-1.021040	0.433666	-1.356216	0.957084	-1.475067	-0.323326	-0.430132	-0.369733	2.906474	2.870293	-0.881089	-0.792789
2	1.242738	-0.428108	0.875151	0.464178	-1.11651	-0.350446	-1.533975	-0.161116	-1.046510	0.459692	-1.157992	0.935335	-1.356511	-0.323326	-0.428491	0.714769	2.952012	2.915447	-0.914221	-0.829063
3	0.995018	-0.428108	0.909202	0.499467	-1.11651	-0.347647	-1.324185	0.095086	-1.071980	0.494393	-0.910213	1.104980	-1.178676	-0.323326	-0.425198	-0.253376	2.997551	2.983178	-0.952874	-0.865337
4	0.747298	-0.428108	0.934740	0.522992	-1.11651	-0.387717	-1.254255	0.119652	-1.092356	0.529095	-0.860657	0.822238	-1.119398	-0.323326	-0.477446	-0.278183	3.065858	3.028332	-1.002572	-0.974158

 

Y = y_norm
Y.head()

The “y” value will be as follows:

0   -0.548800
1   -1.152274
2   -1.152274
3   -1.199605
4   -1.235103
Name: LifeExpectancy, dtype: float64

 

Now, we will split the train and test set and see the shape of the train and test data.

X_train, X_test, y_train, y_test = train_test_split(X_arr, Y_arr, test_size = 0.25, shuffle = True, random_state=0)

X_train = np.asarray(X_train).astype(np.float32)
X_test = np.asarray(X_test).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
y_test = np.asarray(y_test).astype(np.float32)

print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)

Output:

X_train shape:  (1122, 20)
y_train shape:  (1122,)
X_test shape:  (375, 20)
y_test shape:  (375,

 

Now let us build the model inside a get_model() and call the function.

def get_model():
    
    model = Sequential([
        Dense(15, input_shape = (20,), activation = 'relu'),
        Dense(10, activation = 'relu'),
        #Dense(5, activation = 'relu'),
        Dense(1)
    ])

    model.compile(
        loss='mse',
        optimizer='adam',
    )
    
    return model

model = get_model()
model.summary()

Output:

Model: "sequential_122"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_431 (Dense)            (None, 15)                315       
_________________________________________________________________
dense_432 (Dense)            (None, 10)                160       
_________________________________________________________________
dense_433 (Dense)            (None, 1)                 11        
=================================================================
Total params: 486
Trainable params: 486
Non-trainable params: 0

 

Now we will fit the model. Let us take just 15 epochs so that the model does not overfit.

early_stopping = EarlyStopping(monitor='val_loss', patience = 5)

model = get_model()

preds_on_untrained = model.predict(X_test)

history = model.fit(
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs = 15,
    #validation_steps=1,
    callbacks = [early_stopping]
)

 

Let us write a function to plot the model history:

def plot_history(history):
    loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' not in s]
    val_loss_list = [s for s in history.history.keys() if 'loss' in s and 'val' in s]
    acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' not in s]
    val_acc_list = [s for s in history.history.keys() if 'acc' in s and 'val' in s]
    
    if len(loss_list) == 0:
        print('Loss is missing in history')
        return 
    
    ## As loss always exists
    epochs = range(1,len(history.history[loss_list[0]]) + 1)
    
    ## Loss
    plt.figure(1)
    for l in loss_list:
        plt.plot(epochs, history.history[l], 'b', label='Training loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    for l in val_loss_list:
        plt.plot(epochs, history.history[l], 'g', label='Validation loss (' + str(str(format(history.history[l][-1],'.5f'))+')'))
    
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

For every epoch, the model loss is appended to an empty list and at last it is plotted.

 

For better understanding of how the model is fitted, let us see how the training and validation loss has decreased with increase in epochs.

plot_history(history)

The graph looks like the following:

training los validation loss chart

model history

 

But how can we know that our model is better than an untrained model?  The answer is : by plotting the untrained model’s prediction and our trained model’s predictions.

 

def plot_predictions(preds, y_test):
    plt.figure(figsize=(8, 8))
    plt.plot(preds, y_test, 'ro')
    plt.xlabel('Preds')
    plt.ylabel('Labels')
    plt.xlim([-0.5, 0.5])
    plt.ylim([-0.5, 0.5])
    plt.plot([-0.5, 0.5], [-0.5, 0.5], 'b--')
    plt.show()
    return

def compare_predictions(preds1, preds2, y_test):
    plt.figure(figsize=(8, 8))
    plt.plot(preds1, y_test, 'ro', label='Untrained Model')
    plt.plot(preds2, y_test, 'go', label='Trained Model')
    plt.xlabel('Preds')
    plt.ylabel('Labels')
    
    y_min = min(min(y_test), min(preds1), min(preds2))
    y_max = max(max(y_test), max(preds1), max(preds2))
    
    plt.xlim([y_min, y_max])
    plt.ylim([y_min, y_max])
    plt.plot([y_min, y_max], [y_min, y_max], 'b--')
    plt.legend()
    plt.show()
    return

We have defined the functions to plot the predictions of an untrained model and a trained model against the ground truth (y_test).

 

preds_on_trained = model.predict(X_test)
compare_predictions(preds_on_untrained, preds_on_trained, y_test)

Output:

untrained model vs trained model graph

The green points represent the trained model’s predictions and the red points represent the untrained model’s predictions.

Prediction

Now let us take any random row from the dataset and give the other column values (after normalisation) as input to get the life expectancy prediction from the model.

example =pd.DataFrame(sample.iloc[27,:])
example_norm= (example - example.mean()) / example.std()

 

Let us reshape the values of that row’s data.

example_norm= np.asarray(example_norm).astype(np.float32)
np.reshape(example_norm,20)
example_norm
value = np.array(example_norm).reshape(1,20)
value.shape
final_values=model.predict(what)
final_value

Output:

array([[0.21137413]], dtype=float32)

 

Now let us convert the value into the un normalised form.

ans = [convert_label_value(hi)]
ans

Output:

[71]

 

Let us print the value of 27th row and see what it’s value of life expectancy is.

df.loc[27,["LifeExpectancy"]]

Output:

LifeExpectancy    73
Name: 27, dtype: object

 

Our predictions are close!! Thus we have predicted the average life expectancy of a person, given the other influencing factors for life expectancy using TensorFlow regression.

Leave a Reply

Your email address will not be published. Required fields are marked *