Loan Prediction using Machine Learning in Python

Hello readers, In today’s tutorial we will learn to develop a model for Loan prediction using machine learning techniques with the help of Python programming.

Introduction to Loan Prediction using machine learning

Technology had been growing widely in almost every sector to make human life luxurious and comfortable. We have also witnessed tremendous growth in the banking sector. Growth from human labor for written process to digitalizing every process and transaction. Along with this enhancements, lots of people are looking for bank loans and are applying for it. But, the bank have limited capital to be loaned-out for selected people only. The bank faces losses, if the applicant fails to repay the loan amount.

However, as a solution, we are developing a model that can help to predict whether the loan should be approved. This will be done by using the applicants previous history and background. We will be using the Logistic Regression model for predicting and check its accuracy rate.

The main aim of today’s learning is to predict whether a loan will be approved for an applicant based on previous data. So , lets Begin this journey.

Loan Prediction Model steps

  • Step 1 – Importing the libraries.
  • Step 2 – Importing the dataset.
  • Step 3 – Data Cleaning.
  • Step 4 – Exploratory data Analysis.
  • Step 5 –  Classification Model Preparation.
  • Step 6 – Splitting dataset into training and testing.
  • Step 7 – Training the model.
  • Step 8 – Model evaluation.

Dataset used for Loan Prediction

The dataset is been downloaded from Kaggle. It includes total 614 unique Values and 13 attributes all together.

The 13 Attributes Include : Loan ID; Gender; Married; Dependents; Education; Self Employed; Applicant Income; Co-applicant Income; Loan Amount; Loan Amount Term; Credit History; Property Area; Loan Status.

Link for Dataset : Click here .

Now, Lets procced with the code.

Step 1 – Importing the required Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Step 2 – Importing the dataset

Here’s the link for the dataset Click here. You can download the dataset or use this link to import the dataset for the model.

df = pd.read_csv('train.csv') 
df.head()

OUTPUT:

Imported dataset for loan prediction model.

Imported dataset for loan prediction model.

Now that our dataset is imported successfully, let’s understand the dataset better.

df.info()
df.shape

OUTPUT: (614, 13)

Detailed Information of the dataset imported for loan prediction.

Detailed Information of the dataset imported for loan prediction.

Step 3 – Data cleaning.

For the model to function properly the dataset should be clean and accurate. So, we will begin with cleaning and pre-processing our dataset. Firstly, we will Check for missing values in our dataset.

df.isnull().sum()

OUTPUT:

Missing values in the dataset.

Missing values in the dataset.

Now, we are aware of the missing values in our dataset we need to deal with these missing values. We can deal with these missing values in 2 possible ways :

  1. By dropping all the missing values.
  2. By replacing it with mean, Median or with alternative values.

We proceed towards filling the Missing Values in “Loan Amount” & “Credit History” first by their ‘Mean’ & ‘Median’ respectively. This is been done as “Loan Amount” & “Credit History” has maximum missing values. Dropping so many missing values for these attributes will result in reducing the size of the dataset and eventually affect the accuracy for the model.

df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].median())

Let’s confirm if there are any possible missing values in ‘Loan Amount’ & ‘Credit History’

df.isnull().sum()

OUTPUT:

Confirming missing values in "LoanAmount" & "Credit_history"

Confirming missing values in “LoanAmount” & “Credit_history”

In the Image, missing values of “Loan Amount” & “Credit_History” are replaced successfully with their mean and median respectively. We can observe that the missing values of “Loan Amount” & “Credit_History” have reduced down from 22 & 50 to zero.

Now, Let’s drop all the remaining missing values and check for Missing values for the final time!

df.dropna(inplace=True)
df.isnull().sum()

OUTPUT:

Final missing value check in the dataset

Final missing value check in the dataset

Here, we have dropped all the missing values to avoid disturbances in the model. The Loan Prediction model requires all the details to work efficiently and thus the missing values are dropped. Now, Let’s check the final Dataset Shape.

df.shape

OUTPUT : (542, 13)

Step 4: Exploratory Data Analysis

In this step, we will compare between various parameters in getting the Loan. We will visualize the results for better understanding.

plt.figure(figsize = (100, 50))
sns.set(font_scale = 5)
plt.subplot(331)
sns.countplot(df['Gender'],hue=df['Loan_Status'])

plt.subplot(332)
sns.countplot(df['Married'],hue=df['Loan_Status'])

plt.subplot(333)
sns.countplot(df['Education'],hue=df['Loan_Status'])

plt.subplot(334)
sns.countplot(df['Self_Employed'],hue=df['Loan_Status'])

plt.subplot(335)
sns.countplot(df['Property_Area'],hue=df['Loan_Status'])

OUTPUT:

From the visualizations, We can observe and understand our dataset more precisely. For Example, Consider the First plot, We can observe that the Male count is more as compared to Females on getting the loan Approved. Similarly, We have compared various parameters with the Loan Status (i.e., Yes or No ), which ultimately provided us with insights from the dataset.

Now, Let’s replace the Variable values (i.e., Categorical Data) to Numerical data form & display the Value Counts. The reason for doing so is to avoids disturbances in building the model.

df['Loan_Status'].replace('Y',1,inplace=True)
df['Loan_Status'].replace('N',0,inplace=True)
df['Loan_Status'].value_counts()
df.Gender=df.Gender.map({'Male':1,'Female':0})
df['Gender'].value_counts()
df.Married=df.Married.map({'Yes':1,'No':0})
df['Married'].value_counts()
df.Dependents=df.Dependents.map({'0':0,'1':1,'2':2,'3+':3})
df['Dependents'].value_counts()
df.Education=df.Education.map({'Graduate':1,'Not Graduate':0})
df['Education'].value_counts()
df.Self_Employed=df.Self_Employed.map({'Yes':1,'No':0})
df['Self_Employed'].value_counts()
df.Property_Area=df.Property_Area.map({'Urban':2,'Rural':0,'Semiurban':1})
df['Property_Area'].value_counts()
df['LoanAmount'].value_counts()
df['Loan_Amount_Term'].value_counts()
df['Credit_History'].value_counts()
df['Loan_Status'].value_counts()

From above, we can observe that “Credit_History” (Independent Variable) has the maximum correlation with “Loan_Status” (Dependent Variable). Which denotes that the “Loan_Status” is heavily dependent on the “Credit_History”.

Finally, lets dive into creating and working on our Model for loan prediction.

Step 5 –  Classification Model Preparation

Importing the required libraries for model preparation.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Step 6 – Splitting dataset into training and testing

X = df.iloc[1:542,1:12].values
y = df.iloc[1:542,12].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)

We have successfully divided the dataset into training and testing dataset. Now, we will be using the Logistic Regression Model for predicting the Loan aprovals.

What is Logistic Regression?

Logistic regression is a supervised machine learning classification algorithm used to predict the probability of a target variable. It is one of the simplest ML algorithms that can be used for various classification problems. It is mostly preferred while dealing with Categorical datasets.

Representation of Logistic regression.

Representation of Logistic regression.

We will be utilizing the Logistic regression classification model for our dataset and predict the loan approvals.

model = LogisticRegression() 
model.fit(X_train,y_train) 

lr_prediction = model.predict(X_test) 
print('Logistic Regression accuracy = ', metrics.accuracy_score(lr_prediction,y_test))

The accuracy for the logistic regression model turns out to be 0.8852760 (i.e., Approximately 88%)

print("y_predicted",lr_prediction)
print("y_test",y_test)

OUTPUT:

Comparing the Predicted values with the test values.

Comparing the Predicted values with test values.

Conclusion for the Loan prediction Model

In this tutorial, we have successfully learned and created Loan prediction model using Machine learning.

  1. We have seen that the attribute “Loan Status” is heavily dependent on the “Credit History” (Independent attribute) for Predictions.
  2. The Logistic Regression algorithm gives us the maximum Accuracy (88% approx.)
  3. This model can be saved and used with different datasets for predictions.

We have successfully achieved a good accuracy rate using Logistic regression. In future,  different machine learning algorithms can be utilized for predictions.

I hope you all enjoyed this tutorial. Happy Learning 🙂

Want to add your thoughts? Need any further help? Leave a comment below and I will get back to you ASAP 🙂

 

For further reading:

Leave a Reply

Your email address will not be published.