Stroke detection using Regression | Python

Stroke is a medical disorder commonly caused when the supply of blood to a part of the brain is reduced/ interrupted. This will, in turn, reduce the oxygen supply to the brain tissues.  This in turn will result in the death of the brain cells. Thus, it has quite dangerous effects on a person. Some of the major compilations of stroke include paralysis, emotional problem, pain, difficulty swallowing, etc.

Therefore, in this article, we have proposed an efficient logistic regression for the identification of stroke, and the prediction of the regression model is estimated using very efficient tools (Recall, Precision, Accuracy, AUV, ROC Curve, F1 Score, etc) with the help of Python programming. Also, an explanation of all the prediction functions is covered.

The workflow of the article is as follows:

  • Importing Python libraries
  • CSV formatting
  • Data Analysis
  • Pre-processing
  • Building
  • Prediction

Happy Reading!!!


Importing of the necessary libraries.

import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve 
from sklearn.metrics import f1_score
from sklearn.metrics import auc


Importing the CSV file and displaying it using head().

df = pd.read_csv(os.path.join(dirname, filename))

Removing the unnecessary IDs from the imported CSV file.

df.drop(columns = ['id'], inplace = True)

Checking for NULL values in the CSV file to remove them from the file. Using dropna() it was found that 201 out of the 5111 values have NULL values. Then, dropping off those null occurs.

df.apply(lambda x: sum(x.isnull()),axis=0)
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

After dropping the NULL values, displaying the CSV file for visualization.

df.apply(lambda x: sum(x.isnull()),axis=0)
gender            0 
age               0 
hypertension      0 
heart_disease     0 
ever_married      0 
work_type         0 
Residence_type    0 
avg_glucose_level 0 
bmi               0 
smoking_status    0 
stroke            0 
dtype: int64


The plotting of the counts with respect to the presence of stroke or not.

sns.countplot(x="stroke", data=df, palette="bwr")


Plotting of the counts with respect to the presence of hypertension or not.

sns.countplot(x="hypertension", data=df, palette="rocket")


The plotting of the counts with respect to the presence of a male, female, or other.

sns.countplot(x="gender", data=df, palette="deep")



The number of male and female gender counts is estimated from the CSV file.

Female    2897
Male      2011
Other        1
Name: gender, dtype: int64

Dropping the ‘other’ column as it is present only once and will not make an actual impact. Then, converting male and female gender to binary 0 for female and 1 for male. Then, marking of married people as 0, and not married as 1.

Others = df[(df['gender'] == 'Other')].index
df.drop(Others , inplace=True)
df["gender"] = df["gender"].astype('category')
df["gender"] = df["gender"]
df["ever_married"] = df["ever_married"].astype('category')
df["ever_married"] = df["ever_married"]
Private          2810
Self-employed     775
children          671
Govt_job          630
Never_worked       22
Name: work_type, dtype: int64

Cannot form a hierarchy over the work type. Therefore, conversions of the work type into dummies take place. Also, assigning category one to smokers.

df = pd.get_dummies(df, prefix=['w_type'], columns=['work_type'])
df['smoking_status'] = df['smoking_status'].map( 
                   {'formerly smoked':1 ,'smokes':1,'never smoked':0,'Unknown':0}) 
Urban    2490
Rural    2418
Name: Residence_type, dtype: int64

Now, drafting all the formatted analysis into one single tabular format.

df = pd.get_dummies(df, prefix=['residency_'], columns=['Residence_type'])


Starting with Logistic regression.

from sklearn.linear_model import LogisticRegression

Splitting of the XY values. Then stored in a function for ease of usage.

def XYsplit(df, label_col):
    y = df[label_col].copy()
    X = df.drop(label_col,axis=1)
    return X,y

Then, Splitting and storing 0.3 of the total dataset file separately for testing.

X,y = XYsplit(df,'stroke')
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state=0)
LogReg = LogisticRegression(max_iter=1000),y_train)


Calculation of the accuracy, precision, recall, AUC, ROC Curve, and f-score. Then, displayed for visualization and analysis purposes.

Accuracy is the efficiency of the trained logistic regression. Then, precision is the number of true positives by the summation of the true positives and false positives.  Then, recall is the number of true positives by the summation of the true positives and false negatives. And, f1-score is a scaling with the recall and the precision to get a single comparing suit instead of having both recall and precision. F1 score is basically 2 times the product of recall and precision by the summation of the recall and precision.

predictions = LogReg.predict(X_test)
accuracy = accuracy_score(y_test, predictions)*100
precision = precision_score(y_test, predictions,pos_label=1,labels=[0,1])*100
recall = recall_score(y_test, predictions,pos_label=1,labels=[0,1])*100
fpr , tpr, _ = roc_curve(y_test, predictions)
auc_val = auc(fpr, tpr) 
f_score = f1_score(y_test, predictions)

print("Accuracy: \n", accuracy)
print("Precision of event Happening: \n", precision)
print("Recall of event Happening: \n", recall)
print("AUC: \n",auc_val)
print("F-Score:\n", f_score)
plt.title('ROC Curve')
plt.plot(fpr, tpr, label='AUC = {:.2f}'.format(auc_val))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
Accuracy: 96.334
Precision of event Happening: 100
Recall of event Happening: 1.818
AUC: 0.509
F-Score: 0.035


Depiction of the confusion matrix. It is basically a 2 x 2 matrix which has the true positives, true negatives, false positives, and false negative. This helps us to understand the efficiency of the network with ease.

print("Confusion Matrix: \n", confusion_matrix(y_test, predictions,labels=[0,1]))
Confusion Matrix: 
 [[1418    0]
 [  54    1]]


In this article, we have thus discussed stroke prediction from a set of data logged and stored in a CSV file. Logistic regression is done over the data and generation of an efficient model. Then followed by a prediction over the model. Some of the prediction tools used here include Recall, Prediction, Accuracy, F1-Score, AUV, and ROC Curve.

The source code for the stroke prediction can be found and downloaded from here.

The CSV file for the project can be downloaded from here.

To learn from my other machine learning blogs, refer here.

Thank you. Hope this article was helpful for all!

Leave a Reply

Your email address will not be published. Required fields are marked *