Credit Card Fraud Detection using Logistic Regression

This tutorial will discuss credit card fraud detection using Logistic Regression in machine learning using Python. We will use the scikit-learn machine learning module.

This project tells about the given data and whether this transaction is a true transaction or a fraudulent transaction.

You can download the dataset here credit data.

So let’s try to implement the code

Import the necessary Python libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

The first step is Data Collection,

1. Load data

#Load the dataset
credit_data=pd.read_csv("C:\\Users\\usersdrive\\OneDrive\\Desktop\\creditcard.csv")

To check whether the data set is loaded or not print the first five rows by using the head method.

credit_data.head()

OUTPUT:-

Now print the last five rows by using the tail method.

credit_data.tail()

OUTPUT:-

Remember this point whenever u are working on machine learning problems that need data exploration.

DATASET INFORMATION

credit_data.info()

OUTPUT:-

All the values are in the float and integer values so the machine can easily understand the data set easily.

Let’s check whether there are any missing values present in the given dataset.

credit_data.isnull().sum()

OUTPUT:-

No missing values are present in the dataset.

So no need to add mean and median values.

0 represents legit transactions.
1 represents a fraudulent transaction in the given dataset.

To check how many zeros and ones are present in the given dataset.

credit_data['Class'].value_counts()

OUTPUT:-

0    284315
1       492
Name: Class, dtype: int64

By seeing the above output there are only 492 ones present in the given dataset and 284315 zero values present in the dataset.

The dataset is highly unbalanced.

If you directly send the dataset into the prediction the accuracy is very less.

Now from the dataset take 492 zeros values and from one take all values from the dataset then the data set is balanced.

Separate the data for analysis

legit=credit_data[credit_data.Class==0]
fraud=credit_data[credit_data.Class==1]
print(legit.shape)
print(fraud.shape)

OUTPUT:-

(284315, 31)
(492, 31)

Statistical measures of data.

legit.Amount.describe()

OUTPUT:-

To check fraud information.

fraud.Amount.describe()

OUTPUT:-

There is so much difference in the mean values for the two data.

compare the values for both transactions.

credit_data.groupby('Class').mean()

OUTPUT:-

There is a difference between the two classes with that difference we are finding whether that is a fraud transaction or a true transaction.

Dealing with the unbalanced data.

UNDERSAMPLING

Build a sample dataset containing a similar distribution of normal and fraudulent transactions.

legit_sample=legit.sample(n=492)

Concatenating two data frames.

new_dataset=pd.concat([legit_sample,fraud],axis=0)

To check whether the data set is loaded or not print the first five rows by using the head method.

new_dataset.head()

OUTPUT:-

Now print the last five rows by using the tail method.

new_dataset.tail()

OUTPUT:-

Let’s count how many classes are present in the dataset.

new_dataset['Class'].value_counts()

OUTPUT:-

0    492
1    492
Name: Class, dtype: int64

Let’s check the mean difference between the two classes.

new_dataset.groupby('Class').mean()

OUTPUT:-

By seeing the above outputs the difference is reduced in the values.

That difference is how we are calculated now.

splitting the data into features and target

X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
print(X)

OUTPUT:-

Now let’s print the y values.

print(Y)

OUTPUT:-

Split the Training and Test data.

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2
print(X.shape,X_train.shape,X_test.shape)

OUTPUT:-

(984, 30) (787, 30) (197, 30)

MODEL TRAINING

model=LogisticRegression()
print(model)

OUTPUT:-

LogisticRegression()

Training logistic regression model with training data.

model.fit(X_train,Y_train)

OUTPUT:-

LogisticRegression()

MODEL EVALUATION

Accuracy score

X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
print("Accuracy on training data : ",training_data_accuracy)

OUTPUT:-

Accuracy on training data :  0.9491740787801779

Accuracy score on test data.

#accuracy score on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print("Accuracy on test data : ",test_data_accuracy)

OUTPUT:-

Accuracy on test data :  0.9187817258883249
​

The accuracy values for each train and test data are more than 90 percent so this algorithm is suitable for credit card fraud detection.

Leave a Reply

Your email address will not be published. Required fields are marked *