Designing Incremental ELT Pipelines with Delta Lake and Azure Data Factory: A Practical Walkthrough
When the data lake starts to look more like a data swamp, it’s time to rethink how you move data. Incremental ELT (Extract‑Load‑Transform) lets you pull only the changes, keep costs low, and avoid long‑running jobs that tie up your cloud budget. In this post I’ll walk you through a simple, production‑ready pattern that uses Delta Lake and Azure Data Factory (ADF). It’s the same approach I use on the Data Pipeline Chronicles team, and it works whether you’re on a small startup or a big enterprise.
Why Incremental Matters Right Now
Most modern data platforms are built on cheap object storage—Azure Blob, ADLS Gen2, you name it. The storage itself is cheap, but compute isn’t. If you keep re‑processing the whole source every day, you’ll see your bill climb faster than a trending hashtag. Incremental pipelines solve that by:
- Reading only new or changed rows.
- Reducing the amount of data that has to be shuffled.
- Allowing you to schedule more frequent runs without breaking the bank.
The Core Idea: Delta Lake as a Change Tracker
Delta Lake adds a transaction log on top of your parquet files. That log records every add, delete, or update, and it gives you a simple way to ask “what changed since X?”. In ADF you can use a Copy activity to read from the source, then a Data Flow or Mapping Data Flow to write into a Delta table. The key is to use the timestamp or version column that Delta maintains.
Step 1: Set Up Your Delta Table
- Create a folder in ADLS Gen2, e.g.
raw/sales/. - In a Databricks notebook (or Synapse Spark pool) run:
from delta.tables import *
spark.sql("""
CREATE TABLE IF NOT EXISTS sales_delta
USING DELTA
LOCATION 'abfss://[email protected]/raw/sales/'
""")
That command creates an empty Delta table that will store your sales data. The first time you load data, you’ll do a full load; after that, you’ll only pull the delta.
Step 2: Capture the High‑Water Mark
A “high‑water mark” is just the latest timestamp you have processed. Store it in a small Azure SQL Database or even a JSON file in the same storage account. For example, after each run you could write:
{ "last_processed": "2024-06-14T23:59:59Z" }
When the pipeline starts, read that value and pass it to the source query.
Building the ADF Pipeline
Below is the skeleton of the pipeline. I keep it short on purpose—feel free to add logging or error handling as needed.
H2: 1️⃣ Create a Parameter for the Watermark
In ADF, add a pipeline parameter called watermark. Set its default to a far‑past date (1970-01-01). At the end of each run you’ll update the stored value, then the next run will read it back.
H2: 2️⃣ Source – Incremental Query
If your source is Azure SQL Database, the query looks like:
SELECT *
FROM dbo.Sales
WHERE ModifiedDate > @watermark
ADF lets you map the @watermark parameter directly into the query. For other sources (e.g., Snowflake, PostgreSQL) the same idea applies—just use the appropriate syntax.
H2: 3️⃣ Copy Activity – Load into Staging
Copy the result set into a staging folder, say staging/sales/. Keep the data in parquet format; it’s cheap and fast to read later.
H2: 4️⃣ Data Flow – Merge into Delta
Add a Mapping Data Flow with a Delta Lake sink. Use the MERGE operation, matching on your primary key (e.g., SaleID). The flow will:
- Insert rows that don’t exist.
- Update rows where the source
ModifiedDateis newer. - Optionally delete rows if you have a soft‑delete flag.
The Delta sink automatically updates the transaction log, so you always have a reliable history.
H2: 5️⃣ Update the Watermark
After the merge succeeds, add a Stored Procedure activity (or a simple Azure Function) that writes the new max ModifiedDate back to your watermark store. You can compute the max value inside the Data Flow and pass it out as a pipeline variable.
A Quick Personal Tale
The first time I tried an incremental load, I forgot to update the watermark. The pipeline kept pulling the same 10,000 rows over and over. My Azure bill spiked, and I spent an afternoon chasing a “duplicate data” alert. The lesson? Always make the watermark part of the pipeline’s success path. Once I added the update step, the runs dropped from 30 minutes to under 5, and the cost went down dramatically. That’s the kind of win I love to share on Data Pipeline Chronicles.
Tips for a Smooth Run
- Use a reliable timestamp column. If your source doesn’t have a
ModifiedDate, consider adding a CDC (Change Data Capture) feature or using a row‑version column. - Keep the Delta table small enough for quick scans. Partition by date if you expect many years of data.
- Validate row counts. After each merge, compare the source row count with the Delta table’s new rows. A simple
assertcan catch silent failures early. - Monitor the transaction log size. Over time it can grow large; run
VACUUMin Databricks to clean up old files.
Scaling the Pattern
If you need to handle dozens of tables, you can turn the whole thing into a template pipeline. Use ADF’s parameterized pipelines and feed a JSON list of tables, each with its own source query and target Delta path. The same logic applies, and you get a repeatable, auditable process.
Wrapping Up
Incremental ELT with Delta Lake and Azure Data Factory gives you a clean, cost‑effective way to keep your data lake fresh without re‑processing everything. The key steps are:
- Create a Delta table that will hold the data.
- Capture a high‑water mark after each run.
- Use that mark in your source query to pull only changes.
- Merge the changes into Delta with a Data Flow.
- Update the watermark for the next run.
Give it a try on a small table first, watch the run time shrink, and then roll it out across your data domain. Your future self (and your finance team) will thank you.