Phishing website detection using auto-encoders in Keras using Python

Phishing is the most commonly approached cyber-attack in this modern era. Through such attacks, the phisher will target innocent users and steal their details. The detection of a phishing website is becoming a very efficient way of protecting ourselves.

In this article, you will learn about auto-encoders and their implementation in Keras for phishing website detection in Python programming.

The first part of this article is installing the library.

# Type this code in command prompt
!pip install keras

Importing the dataset

The next step is to make sure you have the data set in your working directory. You can download the dataset from here. After you download the data set, make sure to have it loaded in your working directory for easier access. The next snippet of code will tell you about pulling your data set into your environment

data0 = pd.read_csv(r'path to your dataset(urldata.csv)')

data0.head()
#The head will provide you with the first 5 data points in your dataset.

#Checking the shape of the dataset
data0.shape

#Listing the features of the dataset
data0.columns
"""OUTPUT: Index(['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record','Web_Traffic', 'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label'],dtype='object')"""

#Information about the dataset
data0.info()
"""<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Domain         10000 non-null  object
 1   Have_IP        10000 non-null  int64 
 2   Have_At        10000 non-null  int64 
 3   URL_Length     10000 non-null  int64 
 4   URL_Depth      10000 non-null  int64 
 5   Redirection    10000 non-null  int64 
 6   https_Domain   10000 non-null  int64 
 7   TinyURL        10000 non-null  int64 
 8   Prefix/Suffix  10000 non-null  int64 
 9   DNS_Record     10000 non-null  int64 
 10  Web_Traffic    10000 non-null  int64 
 11  Domain_Age     10000 non-null  int64 
 12  Domain_End     10000 non-null  int64 
 13  iFrame         10000 non-null  int64 
 14  Mouse_Over     10000 non-null  int64 
 15  Right_Click    10000 non-null  int64 
 16  Web_Forwards   10000 non-null  int64 
 17  Label          10000 non-null  int64 
dtypes: int64(17), object(1)
memory usage: 1.4+ MB"""

Creation of the auto-encoder network

Auto-encoders downscale an image, try to retain the contents, and again upscale the image and see if it could develop the same old image. Even though Auto-encoders are lossy, their automatic learnability is a useful property. The below code will tell you about the implementation of autoencoders. Do check the comments for a better understanding of the code.

# Sepratating & assigning features and target columns to X & y
y = data['Label']
X = data.drop('Label',axis=1)

# Splitting the dataset into train and test sets: 80-20 split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)


#importing required packages
import keras
from keras.layers import Input, Dense
from keras import regularizers
import tensorflow as tf
from keras.models import Model
from sklearn import metrics

#building autoencoder model

input_dim = X_train.shape[1] #finds the shape of the data
encoding_dim = input_dim

input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="relu", activity_regularizer=regularizers.l1(10e-4))(input_layer) #Here, we have created a layer with input shape as the input dimensions, activation layer as relu and regularizer as well.
encoder = Dense(int(encoding_dim), activation="relu")(encoder)

encoder = Dense(int(encoding_dim-2), activation="relu")(encoder)
code = Dense(int(encoding_dim-4), activation='relu')(encoder)
decoder = Dense(int(encoding_dim-2), activation='relu')(code)#from here the decoding part starts, where the model tries to upscale and regenerate the image

decoder = Dense(int(encoding_dim), activation='relu')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder) #we create a model of our developed architecture of auto-encoder model
autoencoder.summary() # displays the summary of our model.

#compiling the model
autoencoder.compile(optimizer='adam',
                    loss='binary_crossentropy',
                    metrics=['accuracy'])

#Training the model
history = autoencoder.fit(X_train, X_train, epochs=10, batch_size=64, shuffle=True, validation_split=0.2)

The above code will give you an idea of how to create an auto-encoder model. You can understand how to prepare the data for training and testing in the snippet above. After the training and testing are done, you can find the accuracy score that you have achieved from your model. Also, try manipulating the parameters and dataset. This will give you a better understanding of auto-encoders and neural networks in general. The next snippet will tell you how to find the accuracy scores for your developed model. This is important as accuracy scores will tell you whether your model is actually functioning properly or not.

acc_train_auto = autoencoder.evaluate(X_train, X_train)[1]
acc_test_auto = autoencoder.evaluate(X_test, X_test)[1]

print('\nAutoencoder: Accuracy on training Data: {:.3f}' .format(acc_train_auto))
print('Autoencoder: Accuracy on test Data: {:.3f}' .format(acc_test_auto))
#The output which I received
#Autoencoder: Accuracy on training Data: 0.817
#Autoencoder: Accuracy on test Data: 0.818

So, basically, this article will help you in understanding the implementation of auto-encoder for a real-world application in Python using Keras.  You can also increase the functionality of this article and make it a real-time product.

Leave a Reply

Your email address will not be published. Required fields are marked *