Azure Data Factory (ADF) can connect to a wide variety of data sources across different vendors, both on-premises and in the cloud. This makes it an attractive option for enterprises that want flexible orchestration of ETL/ELT pipelines without being locked into a single vendor ecosystem. Databricks, built on Apache Spark, provides significant performance advantages for large-scale data processing through advanced parallelism. Its scalability can lead to cost efficiencies when handling ingestion and transformation of big data.
This Azure Data Factory pipeline integrates Azure Databricks with Azure Data Lake Storage Gen2 (ADLS Gen2) to provide a best-practice quick start for ingesting raw CSV files and transforming them into Delta tables.
The pipeline implements the Landing → Bronze phase of the Medallion architecture (Landing → Bronze → Silver → Gold), establishing a foundation for scalable and structured data processing.
The pipeline performs two key stages:
-
Source to Landing
- Retrieves a CSV file from a Microsoft Learn GitHub repository.
- Appends it into the Landing zone of ADLS Gen2 with a unique file name.
-
Landing to Bronze
- Dynamically creates a Bronze Delta table in Unity Catalog with tracking metadata-columns.
Below are the steps to prepare your Azure environment so that the ADF pipeline + Databricks notebooks will work correctly.
They follow the pattern:
create ADLS → create Databricks & access connector → grant identities access → import notebooks → import ADF template → wire up connections
- Enable Hierarchical Namespace when creating.
- Note the Storage Account name, Resource Group, etc.
Within the storage account’s container/file system, create:
00_landing/products/01_bronze/products/
- In Azure Portal (or via CLI/ARM).
- In Azure Portal (or via CLI/ARM).
- This creates a managed identity / access connector used for secure access.
- References: Access Storage with Managed Identities
- In Azure Storage account → IAM → Add a role assignment.
Storage Blob Data Contributor(As a minimum. See the reference below for recommended additional access).- Reference: Access control model for MI
- Turn on System-Assigned Managed Identity in the ADF resource.
- Grant permissions on ADLS (read/write as needed).
- Grant permissions on the Databricks workspace.
- Access Connector / Service Principal / Managed Identity should have rights to:
- ADLS
- Catalog permissions
- Compute (to run jobs)
- Download and Import the notebooks (
Process Data,Landing to Bronze) from this repository into your Databricks workspace. - Verify:
- Full notebook path (exactly as shown in Databricks) — you will reference this in ADF.
- Run the notebook manually in Databricks to confirm it executes end-to-end.
- In Azure Data Factory:
- Go to Manage → Templates → Import template / Gallery to import the pipeline.
- Databricks Linked Service
- Create a Linked Service to your Databricks workspace.
- Authentication options: Databricks access token or Managed Identity.
- Ensure the linked service points to the correct workspace URL and has the chosen auth method configured.
- Define pipeline parameters:
-
Under the Settings of each notebook activity, reference the notebooks from Databricks using the full path (or use the ADF UI to browse and select from the saved location)
-
Confirm pipeline parameter names correctly map to notebook base parameters
You now have a scalable ADF pipeline that can:
- Save CSVs to ADLS Gen2 (Landing).
- Create a Bronze Delta table for further processing.
