Stroke detection using Regression | Python
Stroke is a medical disorder commonly caused when the supply of blood to a part of the brain is reduced/ interrupted. This will, in turn, reduce the oxygen supply to the brain tissues. This in turn will result in the death of the brain cells. Thus, it has quite dangerous effects on a person. Some of the major compilations of stroke include paralysis, emotional problem, pain, difficulty swallowing, etc.
Therefore, in this article, we have proposed an efficient logistic regression for the identification of stroke, and the prediction of the regression model is estimated using very efficient tools (Recall, Precision, Accuracy, AUV, ROC Curve, F1 Score, etc) with the help of Python programming. Also, an explanation of all the prediction functions is covered.
The workflow of the article is as follows:
- Importing Python libraries
- CSV formatting
- Data Analysis
- Pre-processing
- Building
- Prediction
Happy Reading!!!
IMPORT PYTHON LIBRARIES
Importing of the necessary libraries.
import numpy as np import pandas as pd import os for dirname, _, filenames in os.walk('/input'): for filename in filenames: print(os.path.join(dirname, filename)) import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import roc_auc_score from sklearn.metrics import roc_curve from sklearn.metrics import f1_score from sklearn.metrics import auc
CSV FORMATTING
Importing the CSV file and displaying it using head().
df = pd.read_csv(os.path.join(dirname, filename)) df.head()
Removing the unnecessary IDs from the imported CSV file.
df.drop(columns = ['id'], inplace = True) df.head()
Checking for NULL values in the CSV file to remove them from the file. Using dropna() it was found that 201 out of the 5111 values have NULL values. Then, dropping off those null occurs.
df.apply(lambda x: sum(x.isnull()),axis=0) df.dropna(inplace=True)
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64
After dropping the NULL values, displaying the CSV file for visualization.
df.apply(lambda x: sum(x.isnull()),axis=0)
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
DATA ANALYSIS
The plotting of the counts with respect to the presence of stroke or not.
sns.countplot(x="stroke", data=df, palette="bwr") plt.show()
Plotting of the counts with respect to the presence of hypertension or not.
sns.countplot(x="hypertension", data=df, palette="rocket") plt.show()
The plotting of the counts with respect to the presence of a male, female, or other.
sns.countplot(x="gender", data=df, palette="deep") plt.show()
PRE-PROCESSING
The number of male and female gender counts is estimated from the CSV file.
df['gender'].value_counts()
Female 2897 Male 2011 Other 1 Name: gender, dtype: int64
Dropping the ‘other’ column as it is present only once and will not make an actual impact. Then, converting male and female gender to binary 0 for female and 1 for male. Then, marking of married people as 0, and not married as 1.
Others = df[(df['gender'] == 'Other')].index df.drop(Others , inplace=True) df["gender"] = df["gender"].astype('category') df["gender"] = df["gender"].cat.codes df["ever_married"] = df["ever_married"].astype('category') df["ever_married"] = df["ever_married"].cat.codes df['work_type'].value_counts()
Private 2810 Self-employed 775 children 671 Govt_job 630 Never_worked 22 Name: work_type, dtype: int64
Cannot form a hierarchy over the work type. Therefore, conversions of the work type into dummies take place. Also, assigning category one to smokers.
df = pd.get_dummies(df, prefix=['w_type'], columns=['work_type']) df['smoking_status'] = df['smoking_status'].map( {'formerly smoked':1 ,'smokes':1,'never smoked':0,'Unknown':0}) df['Residence_type'].value_counts()
Urban 2490 Rural 2418 Name: Residence_type, dtype: int64
Now, drafting all the formatted analysis into one single tabular format.
df = pd.get_dummies(df, prefix=['residency_'], columns=['Residence_type']) df.head()
BUILD MODEL
Starting with Logistic regression.
from sklearn.linear_model import LogisticRegression
Splitting of the XY values. Then stored in a function for ease of usage.
def XYsplit(df, label_col): y = df[label_col].copy() X = df.drop(label_col,axis=1) return X,y
Then, Splitting and storing 0.3 of the total dataset file separately for testing.
X,y = XYsplit(df,'stroke') X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state=0) LogReg = LogisticRegression(max_iter=1000) LogReg.fit(X_train,y_train)
LogisticRegression(max_iter=1000)
PREDICTIONS
Calculation of the accuracy, precision, recall, AUC, ROC Curve, and f-score. Then, displayed for visualization and analysis purposes.
Accuracy is the efficiency of the trained logistic regression. Then, precision is the number of true positives by the summation of the true positives and false positives. Then, recall is the number of true positives by the summation of the true positives and false negatives. And, f1-score is a scaling with the recall and the precision to get a single comparing suit instead of having both recall and precision. F1 score is basically 2 times the product of recall and precision by the summation of the recall and precision.
predictions = LogReg.predict(X_test) accuracy = accuracy_score(y_test, predictions)*100 precision = precision_score(y_test, predictions,pos_label=1,labels=[0,1])*100 recall = recall_score(y_test, predictions,pos_label=1,labels=[0,1])*100 fpr , tpr, _ = roc_curve(y_test, predictions) auc_val = auc(fpr, tpr) f_score = f1_score(y_test, predictions) print("Accuracy: \n", accuracy) print("Precision of event Happening: \n", precision) print("Recall of event Happening: \n", recall) print("AUC: \n",auc_val) print("F-Score:\n", f_score) plt.title('ROC Curve') plt.plot(fpr, tpr, label='AUC = {:.2f}'.format(auc_val)) plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.1]) plt.ylim([-0.1,1.1]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.legend(loc='lower right') plt.show()
Accuracy: 96.334 Precision of event Happening: 100 Recall of event Happening: 1.818 AUC: 0.509 F-Score: 0.035
Depiction of the confusion matrix. It is basically a 2 x 2 matrix which has the true positives, true negatives, false positives, and false negative. This helps us to understand the efficiency of the network with ease.
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions,labels=[0,1]))
Confusion Matrix: [[1418 0] [ 54 1]]
FINAL THOUGHTS
In this article, we have thus discussed stroke prediction from a set of data logged and stored in a CSV file. Logistic regression is done over the data and generation of an efficient model. Then followed by a prediction over the model. Some of the prediction tools used here include Recall, Prediction, Accuracy, F1-Score, AUV, and ROC Curve.
The source code for the stroke prediction can be found and downloaded from here.
The CSV file for the project can be downloaded from here.
To learn from my other machine learning blogs, refer here.
Thank you. Hope this article was helpful for all!
Leave a Reply