13- Data Mining / Principal Component Analysis (PCA) and Isolation Forest Algorithms

A practical guide to dimensionality reduction and anomaly detection. An easy guide for everyone !

This repository provides a hands-on guide to two powerful data mining techniques — Principal Component Analysis (PCA) and Isolation Forest.
It explains how to simplify high-dimensional data and detect outliers with visual and practical examples.

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

Introduction
What is PCA?
How does PCA Work?
What is the Isolation Forest Algorithm?
How does Isolation Forest Work?
Steps to Prepare Your Dataset (Data Cleaning)
Data Normalization and Feature Scaling
Visualizations: Scatter Plots and Box Plots
Dimensionality Reduction Comparison: PCA vs. t-SNE
How to Choose the Right Algorithm
Spiral-shaped Datasets: K-Means vs. DBSCAN
Evaluation Metrics for Clustering and Anomaly Detection
Code Implementation
Results and Insights
Conclusion
References

Introduction

This project introduces core Data Mining concepts using simple, interpretable examples in Python.

It demonstrates how to clean data, reduce dimensionality using PCA, and detect anomalies with Isolation Forest.

All code examples are easy to follow and include visualizations for learning and experimentation.

What is PCA ?

PCA (Principal Component Analysis) simplifies large datasets by keeping the most important patterns while reducing noise and redundancy.

It’s like summarizing a book: you lose the extra words but keep the full meaning.

How does PCA work ?

1. Center and normalize your data.
2. Compute covariance matrix.
3. Extract eigenvectors (principal components).
4. Keep the top components explaining most of the variance.
5. Visualize results using biplots or scree plots.

What is the Isolation Forest Algorithm ?

Isolation Forest is used for anomaly detection — spotting items that differ significantly from the majority.

It isolates anomalies by randomly splitting data and measuring how quickly a sample can be separated.

How does Isolation Forest Work ?

1 . Build multiple random trees.
2 . Each split isolates points based on random features.
3 . Fewer splits = more likely an anomaly.
4 . Compute anomaly scores and visualize them.

🧹 Steps to Prepare Your Dataset (Data Cleaning)

- Remove missing or duplicate values.
- Normalize and scale features.
- Visualize your dataset with scatter and box plots to spot trends and outliers.

Data Normalization and Feature Scaling

Scaling ensures that all features contribute equally to the analysis.

Use StandardScaler or MinMaxScaler from sklearn.preprocessing.

Visualizations: Scatter Plots and Box Plots

Visual tools help interpret the structure of your data and spot anomalies.

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title("Scatter Plot Example")
plt.show()

How to Choose the Right Algorithm

- For structured numeric data → use PCA

- For anomaly detection → use Isolation Forest

- For non-linear structures → try t-SNE or DBSCAN

- For spiral or irregular data → prefer DBSCAN over K-Means

🔬 Dimensionality Reduction Comparison: PCA vs. t-SNE

Method	Type	Use Case	Notes
PCA	Linear	Large, structured datasets	Fast, interpretable, reduces dimensionality
t-SNE	Non-linear	Visualizing high-dimensional data	Better for clusters, slower
Isolation Forest	Unsupervised	Outlier / anomaly detection	Works well with high-dimensional data, unsupervised
DBSCAN	Clustering	Detecting clusters of arbitrary shape	Does not require number of clusters, handles noise
K-Means	Clustering	Partitioning data into k clusters	Assumes spherical clusters, fast on large datasets

🌀 Spiral-shaped Datasets: K-Means vs. DBSCAN

- K-Means assumes circular clusters — not good for complex shapes.

- DBSCAN can detect arbitrary-shaped clusters (e.g., spirals) and noise.

PCA Implementation (Code + Visualization)

Here’s a simple example of PCA using the Iris dataset:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize
plt.figure(figsize=(8,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=y, palette='viridis', s=100)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Projection of Iris Dataset")
plt.show()

🔗 Notebook completo: PCA_Analysis.ipynb

Isolation Forest Implementation (Code + Results)

Example using synthetic data:

from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(100, 2)
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]

# Fit the model
clf = IsolationForest(contamination=0.1, random_state=rng)
clf.fit(X)

# Predict anomalies
y_pred = clf.predict(X)

# Visualization
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='coolwarm')
plt.title("Isolation Forest - Anomaly Detection")
plt.show()

🔗 Notebook completo: IsolationForest_Detection.ipynb

⚗️ Interactive Notebooks

Notebook	Google Colab	Binder
PCA Analysis
Isolation Forest

Bibliography

1 Abdi, H. & WilliamsC, L.J. Principal Component Analysis. Wiley Interdisciplinary Reviews, 2010.

2. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.

3. Dunteman, J. Principal Component Analysis. SAGE Publications, 1989.

4. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.

5. Larson & Farber (2015). Estatística Aplicada. Pearson.

6. Liu, F.T. et al. Isolation Forest. IEEE ICDM, 2008.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
Component Analysis (PCA)		Component Analysis (PCA)
Isolation Forest Outlier Detection		Isolation Forest Outlier Detection
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Workbook _PCA-and-Isolation Forest.pdf		Workbook _PCA-and-Isolation Forest.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

13- Data Mining / Principal Component Analysis (PCA) and Isolation Forest Algorithms

A practical guide to dimensionality reduction and anomaly detection. An easy guide for everyone !

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

Introduction

What is PCA ?

How does PCA work ?

What is the Isolation Forest Algorithm ?

How does Isolation Forest Work ?

🧹 Steps to Prepare Your Dataset (Data Cleaning)

Data Normalization and Feature Scaling

Visualizations: Scatter Plots and Box Plots

How to Choose the Right Algorithm

🔬 Dimensionality Reduction Comparison: PCA vs. t-SNE

🌀 Spiral-shaped Datasets: K-Means vs. DBSCAN

PCA Implementation (Code + Visualization)

🔗 Notebook completo: PCA_Analysis.ipynb

Isolation Forest Implementation (Code + Results)

🔗 Notebook completo: IsolationForest_Detection.ipynb

⚗️ Interactive Notebooks

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/13-DataMining_PCA_and_IsolationForest-Guide

Folders and files

Latest commit

History

Repository files navigation

13- Data Mining / Principal Component Analysis (PCA) and Isolation Forest Algorithms

A practical guide to dimensionality reduction and anomaly detection. An easy guide for everyone !

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

🧹 Steps to Prepare Your Dataset (Data Cleaning)

🔬 Dimensionality Reduction Comparison: PCA vs. t-SNE

🌀 Spiral-shaped Datasets: K-Means vs. DBSCAN

🔗 Notebook completo: PCA_Analysis.ipynb

🔗 Notebook completo: IsolationForest_Detection.ipynb

⚗️ Interactive Notebooks

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages