Data Mining in Python Phishing Detection

Abstract

Web Spoofing lures the user to interact with the fake websites rather than the real ones. The
main objective of this attack is to steal the sensitive information from the users. The attacker
creates a ‘shadow’ website that looks similar to the legitimate website. This fraudulent act allows
the attacker to observe and modify any information from the user. In this project, a detection
technique of phishing websites based on checking Uniform Resources Locators (URLs) of web
pages is used.

Goal

The proposed solution is able to distinguish between the legitimate web page and
fake web page by checking the Uniform Resources Locators (URLs) of suspected web pages.
URLs are inspected based on particular characteristics to check the phishing web pages. The
detected attacks are reported for prevention. The performance of the proposed solution is
evaluated using Phistank and Yahoo directory datasets.

Results

The obtained results show that the
detection mechanism is deployable and capable to detect various types of phishing attacks
maintaining a low rate of false alarms.

In order to detect and predict phishing websites, we proposed an intelligent, flexible and effective system coined as “Phishnet” that is based on using Random Forest Classifier. The project has various modules in place to extract the phishing data sets criteria to classify their legitimacy.
Furthermore, a chrome extension is provided to display the status of the website to the user.

API Library Used

PANDAS

Pandas is an open source Python package that is most widely used for
datascience/data analysis and machine learning tasks.

Used to import the datset and further manipulate. Essentially it is used in this project to work with dataframes.

jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

Keras

A high-level, deep learning API developed by google for implementing neural networks.

Imported Maxpooling2D, Dense, Dropout, Activation, Flatten, Convolution2D,

Sequential.

xdjhdskjfsdkfjsddkkkkkfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

Sklearn

Sklearn in python is the most useful machine learning algorithm library. Includes algorithms and tools for statistical modeling. For this project I opted for the XGBoost machine learning algorithm to wrok with from the Sklearn library.

kkkkkkkkkkkkkkkkkkkkkkkkkkkkk

XGBoost

It stands for Extreme Gradient Boosting. This is scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading library for regression, classification and ranking problems.

ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

MatplotLib

This library is often used to display graphical representation of a dataset to better understand and extract information for the stakeholders. In this project it is used along side with NumPy. Uses NumPy and helps in splitting the arrays in the dataset to plot the bar graphs.

Numpy

Numpy is a python library used to perform

mathematical operations specifically on arrays. It is used under the MatPyLib as a big data numerical handling resource. In this project we use NumPy to help us split the arrays of data into training and testing sets.

URLpy

Tkinter is a standard library for creating graphical user interface for desktop based applications.

Here in this project we have used Tkinter to help contruct the GUI and include labels, buttons, dialog boxes and frames.

ffffffffffffffffffffffffffffffffffffffffffffffffffffffff

SeaBorn

It is a python data visualization library based on matplotlib. Helps in maing statistical graphics in python. Helps you focus on what the different elements of your plots. Here we use SeaBorn to display heatmap and histogram of URL 13 feature comparision.

gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg

System Architechture

1. Network Attack Dataset- A KDD datasets which consists of approximately 4,900,000 single connection vectors each of which contains 41 features and is labelled as either normal or an attack with exactly one specific attack type.

2. Data Preprocessing- The processor removes null parameters, clear empty datasets.This is called as Data Normalization.

3.Machine Learning and Deep Learning Algorithms- SVM, LSTM,CNN,Random Forest, Decision Tree, Naïve Bayes and KNN alogirthms are implemented.

4. Algorithm Comparision- Based on the comaparison, a bar chart will be displayed showing which algorithm gives best accuracy and suitable for network based attack.