Skip to content

πŸ‘©πŸ»β€πŸš€ 13-DataMining: Clear, beginner-friendly explanations and hands-on resources on Principal Component Analysis (PCA) and Isolation Forest for Outlier Detection β€” designed to make unsupervised learning approachable for everyone. βœ πŸ’šβœ 

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/13-DataMining_PCA_and_IsolationForest-Guide

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡ΊπŸ‡Έ English]



A practical guide to dimensionality reduction and anomaly detection. An easy guide for everyone !



This repository provides a hands-on guide to two powerful data mining techniques β€” Principal Component Analysis (PCA) and Isolation Forest.
It explains how to simplify high-dimensional data and detect outliers with visual and practical examples.



Sponsor Quantum Software Development




Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva





🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.




Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository







Important

⚠️ Heads Up



Table of Contents



This project introduces core Data Mining concepts using simple, interpretable examples in Python.

It demonstrates how to clean data, reduce dimensionality using PCA, and detect anomalies with Isolation Forest.

All code examples are easy to follow and include visualizations for learning and experimentation.



PCA (Principal Component Analysis) simplifies large datasets by keeping the most important patterns while reducing noise and redundancy.

It’s like summarizing a book: you lose the extra words but keep the full meaning.



1. Center and normalize your data.
2. Compute covariance matrix.
3. Extract eigenvectors (principal components).
4. Keep the top components explaining most of the variance.
5. Visualize results using biplots or scree plots.



Isolation Forest is used for anomaly detection β€” spotting items that differ significantly from the majority.

It isolates anomalies by randomly splitting data and measuring how quickly a sample can be separated.



1 . Build multiple random trees.
2 . Each split isolates points based on random features.
3 . Fewer splits = more likely an anomaly.
4 . Compute anomaly scores and visualize them.



- Remove missing or duplicate values.
- Normalize and scale features.
- Visualize your dataset with scatter and box plots to spot trends and outliers.



Scaling ensures that all features contribute equally to the analysis.

Use StandardScaler or MinMaxScaler from sklearn.preprocessing.



Visual tools help interpret the structure of your data and spot anomalies.


import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title("Scatter Plot Example")
plt.show()



- For structured numeric data β†’ use PCA

- For anomaly detection β†’ use Isolation Forest

- For non-linear structures β†’ try t-SNE or DBSCAN

- For spiral or irregular data β†’ prefer DBSCAN over K-Means




Method Type Use Case Notes
PCA Linear Large, structured datasets Fast, interpretable, reduces dimensionality
t-SNE Non-linear Visualizing high-dimensional data Better for clusters, slower
Isolation Forest Unsupervised Outlier / anomaly detection Works well with high-dimensional data, unsupervised
DBSCAN Clustering Detecting clusters of arbitrary shape Does not require number of clusters, handles noise
K-Means Clustering Partitioning data into k clusters Assumes spherical clusters, fast on large datasets



- K-Means assumes circular clusters β€” not good for complex shapes.

- DBSCAN can detect arbitrary-shaped clusters (e.g., spirals) and noise.



Here’s a simple example of PCA using the Iris dataset:


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize
plt.figure(figsize=(8,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=y, palette='viridis', s=100)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Projection of Iris Dataset")
plt.show()

πŸ”— Notebook completo: PCA_Analysis.ipynb



Example using synthetic data:


from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(100, 2)
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]

# Fit the model
clf = IsolationForest(contamination=0.1, random_state=rng)
clf.fit(X)

# Predict anomalies
y_pred = clf.predict(X)

# Visualization
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='coolwarm')
plt.title("Isolation Forest - Anomaly Detection")
plt.show()

πŸ”— Notebook completo: IsolationForest_Detection.ipynb




Notebook Google Colab Binder
PCA Analysis Colab Binder
Isolation Forest Colab Binder




1 Abdi, H. & WilliamsC, L.J. Principal Component Analysis. Wiley Interdisciplinary Reviews, 2010.

2. Castro, L. N. & Ferrari, D. G. (2016). IntroduΓ§Γ£o Γ  mineraΓ§Γ£o de dados: conceitos bΓ‘sicos, algoritmos e aplicaΓ§Γ΅es. Saraiva.

3. Dunteman, J. Principal Component Analysis. SAGE Publications, 1989.

4. Ferreira, A. C. P. L. et al. (2024). InteligΓͺncia Artificial - Uma Abordagem de Aprendizado de MΓ‘quina. 2nd Ed. LTC.

5. Larson & Farber (2015). EstatΓ­stica Aplicada. Pearson.

6. Liu, F.T. et al. Isolation Forest. IEEE ICDM, 2008.





πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

πŸ‘©πŸ»β€πŸš€ 13-DataMining: Clear, beginner-friendly explanations and hands-on resources on Principal Component Analysis (PCA) and Isolation Forest for Outlier Detection β€” designed to make unsupervised learning approachable for everyone. βœ πŸ’šβœ 

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project