Phishing URL Detection: Random Forest Classification

IMPORTANT: The notebook for this model can be run either via the instructions below, or it can be viewed directly on Kaggle.

Description

The purpose of this model is to classify whether or not a URL is legitimate or a phishing attempt. Phishing is a type of social engineering attack often used to steal user data, including login credentials and credit card numbers. In this case, phishing is determined from the URL—analyzing features such as special characters, redirects, SSL certificates, and domain information. Additionally, we complement the model's predictions by testing against VirusTotal results and a whitelist of known legitimate domains. These external verification results are cached for future use, improving efficiency and accuracy in ongoing detection efforts.

The model employs RandomForest Classification to make these predictions. Random Forest Classification is a machine learning algorithm that uses an ensemble of decision trees to make predictions, particularly for classification tasks. It combines the predictions of multiple, uncorrelated decision trees to improve accuracy and robustness.

To test the final model, execute streamlit run app.py. You can also run it via the deployed model here.

Data Acquisition

The original data acquired from Kaggle can be accessed through the link provided below:

Download Data

Key Features of the Dataset

url_length: The length of the URL.
n_slash: The count of ‘/’ characters in the URL.
n_questionmark: The count of ‘?’ characters in the URL.
n_equal: The count of ‘=’ characters in the URL.
n_at: The count of ‘@’ characters in the URL.
n_and: The count of ‘&’ characters in the URL.
n_exclamation: The count of ‘!’ characters in the URL.
n_asterisk: The count of ‘*’ characters in the URL.
n_hastag: The count of ‘#’ characters in the URL.
n_percent: The count of ‘%’ characters in the URL.
dots_per_length: The amount of '.' per URL query.
hyphens_per_length: The amount of '-' per URL query.
is_long_url: Is the URL query an abnormally long string.
has_many_dots: Does it have abnomormal amounts of '.'
has_ssl: Does it have an SSL certificate.
is_cloudflare_protected: Is the URL Cloudflare protected.
special_char_density: Ratio of special characters (*&@#) within URL.
suspicious_tld_risk: Risk of URL containing suspicious extensions, domains, and patterns.
has_redirects: Does URL have redirects.
risk_score: Ultimate risk score of URL based on characteristics.
url_complexity: Ultimate URL complexity based on characteristics.
phishing: The Labels of the URL. 1 is phishing and 0 is legitimate.

Features

Data cleaning and preprocessing
Statistical, univariate, and bivariate analysis.
Visualization of data distributions and relationships.
Training, evaluation, and deployment of Random Forest model.

Project Structure

data/: Contains the dataset used for modelling.
model/:
- notebook.ipynb: Jupyter notebook detailing the training process.
- requirements.txt: Requirements for jupyter notebook.
streamlit/app.py: Streamlit application code for deployment.
requirements.txt: Requirements for streamlit app.
README.md: Project documentation.

Installation

Prerequisites

Python Version: 3.13.2 | packaged by Anaconda
jupyter notebook version 7.3.3
Install the required libraries using: pip install -r requirements.txt.

Running the Notebook

Open the .ipynb file in Jupyter by running: jupyter notebook.
Run all cells in the notebook.

Sample Visualization

*The site flagged as phishing was sourced from PhishTank

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or suggestions, please contact me via the email on my profile or LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cache		cache
data		data
app.py		app.py
phishing_model.pkl		phishing_model.pkl
readME.md		readME.md
requirements.txt		requirements.txt
scaler.pkl		scaler.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phishing URL Detection: Random Forest Classification

Description

Data Acquisition

Key Features of the Dataset

Features

Project Structure

Installation

Prerequisites

Running the Notebook

Sample Visualization

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

christinec-dev/PhishingURLModel

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection: Random Forest Classification

Description

Data Acquisition

Key Features of the Dataset

Features

Project Structure

Installation

Prerequisites

Running the Notebook

Sample Visualization

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages