Skip to content

ty-dataloom/ADF-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

ADF Pipeline: Save CSV to ADLS and Create Delta Table

Why ADF + Databricks?

Azure Data Factory (ADF) can connect to a wide variety of data sources across different vendors, both on-premises and in the cloud. This makes it an attractive option for enterprises that want flexible orchestration of ETL/ELT pipelines without being locked into a single vendor ecosystem. Databricks, built on Apache Spark, provides significant performance advantages for large-scale data processing through advanced parallelism. Its scalability can lead to cost efficiencies when handling ingestion and transformation of big data.

Overview

This Azure Data Factory pipeline integrates Azure Databricks with Azure Data Lake Storage Gen2 (ADLS Gen2) to provide a best-practice quick start for ingesting raw CSV files and transforming them into Delta tables.
The pipeline implements the Landing → Bronze phase of the Medallion architecture (Landing → Bronze → Silver → Gold), establishing a foundation for scalable and structured data processing.

image

The pipeline performs two key stages:

  1. Source to Landing

    • Retrieves a CSV file from a Microsoft Learn GitHub repository.
    • Appends it into the Landing zone of ADLS Gen2 with a unique file name.
  2. Landing to Bronze

    • Dynamically creates a Bronze Delta table in Unity Catalog with tracking metadata-columns.

Setup Steps

Below are the steps to prepare your Azure environment so that the ADF pipeline + Databricks notebooks will work correctly.
They follow the pattern:

create ADLS → create Databricks & access connector → grant identities access → import notebooks → import ADF template → wire up connections


Prerequisites & Resource Setup

1. Create an ADLS Gen2 storage account

  • Enable Hierarchical Namespace when creating.
  • Note the Storage Account name, Resource Group, etc.

2. Create folders in ADLS

Within the storage account’s container/file system, create:

  • 00_landing/products/
  • 01_bronze/products/

3. Create an Azure Databricks workspace

  • In Azure Portal (or via CLI/ARM).

4. Create a Databricks Access Connector


Permissions / Identity Granting

1. Grant the Access Connector Managed Identity access to ADLS

  • In Azure Storage account → IAM → Add a role assignment.
  • Storage Blob Data Contributor (As a minimum. See the reference below for recommended additional access).
  • Reference: Access control model for MI

2. Enable ADF’s Managed Identity and grant access

  • Turn on System-Assigned Managed Identity in the ADF resource.
  • Grant permissions on ADLS (read/write as needed).
  • Grant permissions on the Databricks workspace.

3. Ensure Databricks has correct permissions

  • Access Connector / Service Principal / Managed Identity should have rights to:
    • ADLS
    • Catalog permissions
    • Compute (to run jobs)

Import & Wire Up

1. Import Notebooks into Databricks

  • Download and Import the notebooks (Process Data, Landing to Bronze) from this repository into your Databricks workspace.
  • Verify:
    • Full notebook path (exactly as shown in Databricks) — you will reference this in ADF.
    • Run the notebook manually in Databricks to confirm it executes end-to-end.

2. Import ADF Pipeline Template

  • In Azure Data Factory:
  • Go to Manage → Templates → Import template / Gallery to import the pipeline.

3. Configure Linked Services & Connections in ADF

  • Databricks Linked Service
  • Create a Linked Service to your Databricks workspace.
  • Authentication options: Databricks access token or Managed Identity.
  • Ensure the linked service points to the correct workspace URL and has the chosen auth method configured.

4. Set Pipeline Parameters & Map to Databricks Notebook Activity

  • Define pipeline parameters:
image
  • Under the Settings of each notebook activity, reference the notebooks from Databricks using the full path (or use the ADF UI to browse and select from the saved location)

  • Confirm pipeline parameter names correctly map to notebook base parameters

    image

5. Validate & Run ADF Pipeline

  • Validate and run the pipeline
    image

Success

You now have a scalable ADF pipeline that can:

  • Save CSVs to ADLS Gen2 (Landing).
  • Create a Bronze Delta table for further processing.

About

Documentation for ADF + Databricks template

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published