An end-to-end data engineering pipeline on Microsoft Azure leveraging the publicly available Netflix dataset. This project covers:
- Data Ingestion (Bronze)
- Data Processing & Cleaning (Silver)
- Data Quality & Delivery (Gold)
- Automation & Orchestration
Medallion Layers:
| Layer | Purpose |
|---|---|
| Bronze | Ingest raw data (Autoloader & ADF) into Delta format |
| Silver | Clean, dedupe, enrich; enforce schemas with PySpark |
| Gold | Apply Delta Live Tables for quality checks & aggregations |
-
Sources
Netflix_titles.csvin ADLS Gen2 (rawdata/Netflix_titles.csv)- Lookup tables (directors, cast, categories, countries) from
github
-
Orchestration
- Azure Data Factory pipelines using Copy Data, ForEach, validation and If Condition activities
- Parameterized datasets & pipelines for reusability
-
Autoload
- Incremental ingest of new CSV files into
bronze.netflix_titles_deltausing Databricks Autoloader
- Incremental ingest of new CSV files into
-
Storage
- All raw ingestions stored as Delta tables in the
bronze/container
- All raw ingestions stored as Delta tables in the
Compute
-
Azure Databricks PySpark notebooks
-
Transformations
- Split multi-valued columns (e.g., rating)
- Remove duplicates, filter invalid records
- filling of null values
- Cast of data types for analytics readiness
-
Orchestration
- Databricks Workflows chaining parameterized notebooks
-
Output
- Cleaned Delta tables in the
silver/container
- Cleaned Delta tables in the
-
Framework
- Delta Live Tables (DLT) for declarative pipelines
-
Data Quality
- Define Expectations (e.g.,
NOT NULL,UNIQUEetc.) - Configure actions:
drop
- Define Expectations (e.g.,
| Component | Purpose |
|---|---|
| Azure Data Factory (ADF) | Data orchestration & ingestion |
| Azure Data Lake Storage | Scalable storage for Delta tables |
| Azure Databricks | Spark-based ETL & Delta Live Tables |
| Delta Lake | ACID-compliant, performant data format |
| Python / PySpark | Data transformation logic |



