[π§π· PortuguΓͺs] [πΊπΈ English]
This repository provides a hands-on guide to two powerful data mining techniques β Principal Component Analysis (PCA) and Isolation Forest.
It explains how to simplify high-dimensional data and detect outliers with visual and practical examples.
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
- Introduction
- What is PCA?
- How does PCA Work?
- What is the Isolation Forest Algorithm?
- How does Isolation Forest Work?
- Steps to Prepare Your Dataset (Data Cleaning)
- Data Normalization and Feature Scaling
- Visualizations: Scatter Plots and Box Plots
- Dimensionality Reduction Comparison: PCA vs. t-SNE
- How to Choose the Right Algorithm
- Spiral-shaped Datasets: K-Means vs. DBSCAN
- Evaluation Metrics for Clustering and Anomaly Detection
- Code Implementation
- Results and Insights
- Conclusion
- References
This project introduces core Data Mining concepts using simple, interpretable examples in Python.
It demonstrates how to clean data, reduce dimensionality using PCA, and detect anomalies with Isolation Forest.
All code examples are easy to follow and include visualizations for learning and experimentation.
PCA (Principal Component Analysis) simplifies large datasets by keeping the most important patterns while reducing noise and redundancy.
Itβs like summarizing a book: you lose the extra words but keep the full meaning.
1. Center and normalize your data.
2. Compute covariance matrix.
3. Extract eigenvectors (principal components).
4. Keep the top components explaining most of the variance.
5. Visualize results using biplots or scree plots.
Isolation Forest is used for anomaly detection β spotting items that differ significantly from the majority.
It isolates anomalies by randomly splitting data and measuring how quickly a sample can be separated.
1 . Build multiple random trees.
2 . Each split isolates points based on random features.
3 . Fewer splits = more likely an anomaly.
4 . Compute anomaly scores and visualize them.
- Remove missing or duplicate values.
- Normalize and scale features.
- Visualize your dataset with scatter and box plots to spot trends and outliers.
Scaling ensures that all features contribute equally to the analysis.
Use StandardScaler or MinMaxScaler from sklearn.preprocessing.
Visual tools help interpret the structure of your data and spot anomalies.
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x='feature1', y='feature2', data=df)
plt.title("Scatter Plot Example")
plt.show()- For structured numeric data β use PCA
- For anomaly detection β use Isolation Forest
- For non-linear structures β try t-SNE or DBSCAN
- For spiral or irregular data β prefer DBSCAN over K-Means
| Method | Type | Use Case | Notes |
|---|---|---|---|
| PCA | Linear | Large, structured datasets | Fast, interpretable, reduces dimensionality |
| t-SNE | Non-linear | Visualizing high-dimensional data | Better for clusters, slower |
| Isolation Forest | Unsupervised | Outlier / anomaly detection | Works well with high-dimensional data, unsupervised |
| DBSCAN | Clustering | Detecting clusters of arbitrary shape | Does not require number of clusters, handles noise |
| K-Means | Clustering | Partitioning data into k clusters | Assumes spherical clusters, fast on large datasets |
- K-Means assumes circular clusters β not good for complex shapes.
- DBSCAN can detect arbitrary-shaped clusters (e.g., spirals) and noise.
Hereβs a simple example of PCA using the Iris dataset:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Visualize
plt.figure(figsize=(8,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=y, palette='viridis', s=100)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Projection of Iris Dataset")
plt.show()π Notebook completo: PCA_Analysis.ipynb
Example using synthetic data:
from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt
# Generate synthetic data
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(100, 2)
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# Fit the model
clf = IsolationForest(contamination=0.1, random_state=rng)
clf.fit(X)
# Predict anomalies
y_pred = clf.predict(X)
# Visualization
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='coolwarm')
plt.title("Isolation Forest - Anomaly Detection")
plt.show()π Notebook completo: IsolationForest_Detection.ipynb
βοΈ Interactive Notebooks
| Notebook | Google Colab | Binder |
|---|---|---|
| PCA Analysis | ||
| Isolation Forest |
1 Abdi, H. & WilliamsC, L.J. Principal Component Analysis. Wiley Interdisciplinary Reviews, 2010.
2. Castro, L. N. & Ferrari, D. G. (2016). IntroduΓ§Γ£o Γ mineraΓ§Γ£o de dados: conceitos bΓ‘sicos, algoritmos e aplicaΓ§Γ΅es. Saraiva.
3. Dunteman, J. Principal Component Analysis. SAGE Publications, 1989.
4. Ferreira, A. C. P. L. et al. (2024). InteligΓͺncia Artificial - Uma Abordagem de Aprendizado de MΓ‘quina. 2nd Ed. LTC.
5. Larson & Farber (2015). EstatΓstica Aplicada. Pearson.
6. Liu, F.T. et al. Isolation Forest. IEEE ICDM, 2008.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.