Credit Card Fraud Detection using Logistic Regression
This tutorial will discuss credit card fraud detection using Logistic Regression in machine learning using Python. We will use the scikit-learn machine learning module.
This project tells about the given data and whether this transaction is a true transaction or a fraudulent transaction.
You can download the dataset here credit data.
So let’s try to implement the code
Import the necessary Python libraries.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score
The first step is Data Collection,
1. Load data
#Load the dataset credit_data=pd.read_csv("C:\\Users\\usersdrive\\OneDrive\\Desktop\\creditcard.csv")
To check whether the data set is loaded or not print the first five rows by using the head method.
credit_data.head()
OUTPUT:-
Now print the last five rows by using the tail method.
credit_data.tail()
OUTPUT:-
Remember this point whenever u are working on machine learning problems that need data exploration.
DATASET INFORMATION
credit_data.info()
OUTPUT:-
All the values are in the float and integer values so the machine can easily understand the data set easily.
Let’s check whether there are any missing values present in the given dataset.
credit_data.isnull().sum()
OUTPUT:-
No missing values are present in the dataset.
So no need to add mean and median values.
0 represents legit transactions.
1 represents a fraudulent transaction in the given dataset.
To check how many zeros and ones are present in the given dataset.
credit_data['Class'].value_counts()
OUTPUT:-
0 284315 1 492 Name: Class, dtype: int64
By seeing the above output there are only 492 ones present in the given dataset and 284315 zero values present in the dataset.
The dataset is highly unbalanced.
If you directly send the dataset into the prediction the accuracy is very less.
Now from the dataset take 492 zeros values and from one take all values from the dataset then the data set is balanced.
Separate the data for analysis
legit=credit_data[credit_data.Class==0] fraud=credit_data[credit_data.Class==1]
print(legit.shape) print(fraud.shape)
OUTPUT:-
(284315, 31) (492, 31)
Statistical measures of data.
legit.Amount.describe()
OUTPUT:-
To check fraud information.
fraud.Amount.describe()
OUTPUT:-
There is so much difference in the mean values for the two data.
compare the values for both transactions.
credit_data.groupby('Class').mean()
OUTPUT:-
There is a difference between the two classes with that difference we are finding whether that is a fraud transaction or a true transaction.
Dealing with the unbalanced data.
UNDERSAMPLING
Build a sample dataset containing a similar distribution of normal and fraudulent transactions.
legit_sample=legit.sample(n=492)
Concatenating two data frames.
new_dataset=pd.concat([legit_sample,fraud],axis=0)
To check whether the data set is loaded or not print the first five rows by using the head method.
new_dataset.head()
OUTPUT:-
Now print the last five rows by using the tail method.
new_dataset.tail()
OUTPUT:-
Let’s count how many classes are present in the dataset.
new_dataset['Class'].value_counts()
OUTPUT:-
0 492 1 492 Name: Class, dtype: int64
Let’s check the mean difference between the two classes.
new_dataset.groupby('Class').mean()
OUTPUT:-
By seeing the above outputs the difference is reduced in the values.
That difference is how we are calculated now.
splitting the data into features and target
X=new_dataset.drop(columns='Class',axis=1) Y=new_dataset['Class']
print(X)
OUTPUT:-
Now let’s print the y values.
print(Y)
OUTPUT:-
Split the Training and Test data.
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2
print(X.shape,X_train.shape,X_test.shape)
OUTPUT:-
(984, 30) (787, 30) (197, 30)
MODEL TRAINING
model=LogisticRegression()
print(model)
OUTPUT:-
LogisticRegression()
Training logistic regression model with training data.
model.fit(X_train,Y_train)
OUTPUT:-
LogisticRegression()
MODEL EVALUATION
Accuracy score
X_train_prediction=model.predict(X_train) training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
print("Accuracy on training data : ",training_data_accuracy)
OUTPUT:-
Accuracy on training data : 0.9491740787801779
Accuracy score on test data.
#accuracy score on test data X_test_prediction=model.predict(X_test) test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print("Accuracy on test data : ",test_data_accuracy)
OUTPUT:-
Accuracy on test data : 0.9187817258883249
The accuracy values for each train and test data are more than 90 percent so this algorithm is suitable for credit card fraud detection.
Leave a Reply