Data Pipelines and Orchestration for Data Engineers (2026)
In short
Pipeline orchestration in 2026 is a tiered choice, not a religious one. Apache Airflow remains the dominant scheduler at Fortune 500 and SaaS-tier shops because of its plugin ecosystem, the managed offerings (MWAA, Cloud Composer, Astronomer), and a decade of operational tribal knowledge. Dagster is the strongest challenger when the team treats data as software-defined assets and wants stronger typing, native IO managers, and software-defined-asset lineage out of the box. Prefect wins on developer ergonomics and dynamic-DAG flexibility for ML and event-driven workloads. dbt has quietly become the de-facto in-warehouse orchestrator for SQL transformations — most modern stacks run an ELT loader (Airbyte, Fivetran, or a custom Singer tap) into the warehouse, then dbt for transformation, then a thin Airflow / Dagster wrapper to schedule the whole graph. The senior bar in 2026: idempotent pipelines, deterministic backfills, retry semantics that distinguish transient from permanent failures, and a clear ELT-not-ETL bias for warehouse-resident data.
Key takeaways
- Apache Airflow is still the default orchestrator at most enterprise data shops in 2026 — not because it's the best on every axis, but because the operator ecosystem, the managed offerings (AWS MWAA, Google Cloud Composer, Astronomer), and the operational know-how dwarf every alternative. The official docs (airflow.apache.org/docs) are canonical.
- Dagster is the principled challenger. Its software-defined-assets model treats every table, file, and model as a typed asset with lineage; the framework owns the materialization graph instead of treating it as a side effect of task execution. Dagster wins where teams want strong typing, asset-level observability, and native IO managers.
- Prefect wins on dynamic DAGs and developer ergonomics. Prefect 2.x / 3.x flows are plain Python; the orchestrator infers the DAG from execution rather than requiring a pre-declared graph. This is a strong fit for ML pipelines, event-driven workloads, and teams that don't want a heavy scheduler.
- dbt has become the de-facto in-warehouse orchestrator for SQL transformations. The modern data stack pattern: an EL tool (Fivetran, Airbyte, custom tap) lands raw data, dbt transforms it inside the warehouse, and a thin Airflow / Dagster job schedules the dbt run. The dbt blog (getdbt.com/blog) covers the orchestration patterns.
- ETL is dead for warehouse-resident analytics; ELT is the modern pattern. Land the raw data in the warehouse first, then transform inside the warehouse using SQL (dbt) where the compute is cheap and parallel. ETL still applies where the source can't tolerate raw extracts (PII redaction at source) or the destination isn't a warehouse.
- Idempotent pipelines are the senior-DE bar. Every run with the same inputs must produce the same output regardless of how many times it ran or what state the destination was in beforehand. The pattern: deterministic partitions, MERGE / INSERT-ON-CONFLICT semantics, and stateless task logic.
- Backfill strategy is more important than first-build correctness. Production pipelines fail; partitions need to be replayed. Senior data engineers design pipelines so that backfilling a single day, a single hour, or a six-month window is a one-command operation that produces identical output to the original run.
Airflow remains dominant — but here's when to consider alternatives
Apache Airflow has been the dominant data-orchestration tool since roughly 2018 and remains so in 2026. The reasons are operational, not aesthetic: the provider / operator ecosystem covers every major data system (Snowflake, BigQuery, Databricks, S3, dbt, Spark, Postgres, Kafka, every cloud provider); the managed offerings (AWS MWAA, Google Cloud Composer, Astronomer's hosted Airflow) take the operational burden off the data team; and a decade of accumulated tribal knowledge means most senior data engineers can debug an Airflow problem from memory. Maxime Beauchemin, Airflow's creator, has written extensively about the problem space (maximebeauchemin.medium.com) and his "functional data engineering" essay is still the canonical case for idempotent, partition-driven pipelines.
The modern Airflow pattern is the TaskFlow API (Airflow 2.0+), which lets you write DAGs as decorated Python functions. The TaskFlow style eliminates most of the boilerplate of the older PythonOperator / XCom pattern:
from datetime import datetime, timedelta
from airflow.decorators import dag, task
@dag(
dag_id="user_events_daily",
start_date=datetime(2026, 1, 1),
schedule="0 2 * * *", # 02:00 UTC daily
catchup=False,
max_active_runs=1,
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
tags=["analytics", "user_events"],
)
def user_events_daily():
@task
def extract(logical_date: str) -> str:
# Idempotent extract — partition path is deterministic.
return f"s3://raw/user_events/dt={logical_date}/"
@task
def load(s3_uri: str) -> int:
# MERGE INTO warehouse; returns row count.
return run_warehouse_merge(s3_uri)
@task
def validate(row_count: int) -> None:
if row_count == 0:
raise ValueError("Zero rows landed — upstream broken")
s3_uri = extract("{{ ds }}")
rows = load(s3_uri)
validate(rows)
user_events_daily()
What this DAG gets right: the schedule is a plain cron string; catchup=False stops Airflow from retroactively backfilling on deploy; max_active_runs=1 stops two concurrent runs from racing on the same partition; retries=3 with a 5-minute delay handles transient network errors; {{ ds }} templates the logical date so the partition path is deterministic. The Airflow docs (airflow.apache.org/docs/apache-airflow/stable/index.html) are canonical.
When to consider an alternative:
- Dagster when the team wants software-defined assets, asset lineage as a first-class concept, strong typing on inputs / outputs, and IO managers that decouple compute from storage. Dagster's blog (dagster.io/blog) covers the asset-first model in depth.
- Prefect when DAGs are dynamic (number of tasks unknown until runtime), workloads are event-driven (not cron-driven), or developer ergonomics matter more than the operator ecosystem. Prefect 3.x flows are plain Python without DAG declaration overhead.
- Kestra, Mage, Argo Workflows for niche use cases (Kestra for declarative-YAML-friendly teams, Mage for notebook-flavored workflows, Argo for Kubernetes-native compute graphs). None of these have crossed Airflow's share at general data-engineering shops in 2026.
The senior bar: you can articulate why your team is on Airflow (or off it), and the answer is operational and ecosystem-driven, not "the docs looked nice."
ETL vs ELT in 2026
The ETL-vs-ELT distinction is one of the few terminology debates in data engineering that genuinely matters. The order of the letters reflects where the transformation happens — and that choice cascades through your stack architecture, cost model, and debugging story.
ETL (Extract, Transform, Load) — the legacy pattern. A pipeline tool (Informatica, Talend, custom Python) reads from the source, transforms in-flight (often on a dedicated transformation server), then loads the result into the destination. The transform happens before the data lands.
ELT (Extract, Load, Transform) — the modern pattern. The pipeline lands raw or near-raw data into the warehouse first; transformation happens inside the warehouse using SQL (typically dbt). The transform is a downstream concern, not part of the loader.
Why ELT won for warehouse-resident analytics:
- Warehouse compute became cheap and parallel. Snowflake, BigQuery, Databricks SQL, Redshift, and ClickHouse all execute SQL transformations at massive parallelism. The economics flipped: it's cheaper to land raw data and let the warehouse transform than to maintain a dedicated ETL fleet.
- Raw data is a debugging asset. When a downstream metric looks wrong, the data engineer needs the original raw rows to triage. ELT preserves them; ETL discards them.
- SQL is the universal transformation language. Every analyst, analytics engineer, and data scientist on the team can read and contribute to SQL transformations. ETL pipelines written in proprietary tools or one-off Python are illegible to most of the team.
- dbt + Snowflake / BigQuery / Databricks emerged as a stack. The combination of dbt for transformation and a modern columnar warehouse for compute became the de-facto modern data stack. The dbt blog (getdbt.com/blog) and Joe Reis's Fundamentals of Data Engineering (O'Reilly) cover this stack and its trade-offs in depth.
When ETL still applies:
- PII redaction at source. When the destination team is not authorized to see raw PII, transformation must happen before landing. The redaction layer is part of the pipeline, not a downstream model.
- Streaming ingestion with tight transformation requirements. Kafka Streams, Flink, and ksqlDB transform in-flight because the destination is a real-time consumer, not a warehouse. Confluent's blog (confluent.io/blog) covers CDC and streaming transformation patterns extensively.
- Compute-bound transformations the warehouse can't handle. ML feature engineering with custom Python, image / video processing, large-scale graph algorithms — none of these run well in SQL, so the transformation happens in Spark / Beam / Ray before the warehouse load.
- Operational data stores (not warehouses). If the destination is Postgres / MySQL serving an operational app, ELT in the destination is rarely an option. The transform must happen upstream.
The connector layer is its own decision. Fivetran dominates the managed-connector market for SaaS-source ingestion (Salesforce, HubSpot, Stripe, Zendesk); the trade-off is per-MAR pricing that scales unpredictably. Airbyte is the open-source / self-hosted alternative; the trade-off is operational burden. Custom Singer / Meltano taps are the build-your-own option for sources without managed coverage. The 2026 default at Fortune 500 / SaaS-tier shops: Fivetran for SaaS sources where the spend is justified, Airbyte for everything else, Debezium for CDC from operational databases.
Idempotency + backfill: the senior-DE bar
The single biggest signal of a senior+ data engineer is that their pipelines are idempotent. The pipeline can run once, twice, or fifty times on the same logical partition and produce identical results. This is not a nice-to-have; it is the precondition for backfill, retry, disaster-recovery, and most operational sanity.
An idempotent pipeline has three properties:
- Deterministic partitioning. Every input row belongs to a stable partition (commonly a date / hour). The pipeline operates on one partition at a time, never on "everything since last run."
- Stateless task logic. Tasks read inputs by partition key and write outputs by partition key. They do not maintain across-run state in local files, in-memory variables, or task-instance side channels.
- Replace-or-merge writes. Loads either fully replace the destination partition or use MERGE / INSERT-ON-CONFLICT semantics that produce the same result regardless of pre-existing state.
The canonical idempotent INSERT pattern in a Postgres / Redshift / Snowflake-flavored warehouse — note the deduplication via row_number on a stable business key:
-- Idempotent merge of a single daily partition.
-- Safe to run any number of times; result depends only on inputs.
BEGIN;
-- 1. Delete the target partition (idempotent — works whether or not it exists).
DELETE FROM analytics.user_events
WHERE event_date = DATE '2026-04-29';
-- 2. Insert deduplicated rows from the raw landing table.
-- Dedup on (user_id, event_id) keeping the latest by ingest_ts.
INSERT INTO analytics.user_events
(event_date, user_id, event_id, event_type, event_payload, ingest_ts)
SELECT event_date, user_id, event_id, event_type, event_payload, ingest_ts
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id, event_id
ORDER BY ingest_ts DESC
) AS rn
FROM raw.user_events_landing
WHERE event_date = DATE '2026-04-29'
) ranked
WHERE rn = 1;
COMMIT;
What this pattern gets right: the entire operation is wrapped in a transaction so it commits atomically; the DELETE-then-INSERT replaces the partition wholesale (idempotent); ROW_NUMBER deduplicates on a stable business key with a deterministic tiebreaker (ingest_ts); the partition column (event_date) is the only filter — no "all rows since X" semantics that depend on prior state.
The pattern this replaces — the broken anti-pattern: INSERT INTO analytics.user_events SELECT * FROM raw.user_events_landing WHERE ingest_ts > (SELECT MAX(ingest_ts) FROM analytics.user_events). This anti-pattern reads existing destination state to decide what to write — meaning two concurrent runs race, replays produce different results, and a failed mid-run leaves the destination in an unrecoverable partial state.
Backfill is the operational consequence of idempotency. A backfill is just running the pipeline for an older partition. Senior data engineers design pipelines so that backfilling one day, one hour, or six months is a single command:
- Airflow:
airflow dags backfill --start-date 2026-04-01 --end-date 2026-04-29 user_events_daily - Dagster: backfill a partition set from the UI or via the GraphQL API; Dagster's asset model makes partition-aware backfill native.
- dbt:
dbt run --select tag:daily --vars '{"start_date": "2026-04-01", "end_date": "2026-04-29"}'against incremental models with date-aware sql.
Retry semantics distinguish transient from permanent failures. Senior+ pipelines retry only on transient failures (network blips, rate limits, transient warehouse errors) and fail fast on permanent ones (schema mismatch, auth failure, malformed data). Airflow's default exponential backoff with a retry cap is the right starting point; the wrong move is "retries=10" on every task without distinguishing failure modes.
dbt-as-orchestrator pattern
One of the quieter shifts in the modern data stack is that dbt has become the in-warehouse orchestrator. The pattern: a thin scheduler (Airflow, Dagster, dbt Cloud) triggers a dbt run; dbt itself owns the dependency graph, the materialization order, and the test execution for everything inside the warehouse. The external orchestrator is a thin wrapper around dbt, not a parallel DAG of SQL templates.
The reasons this pattern won:
- dbt's ref() macro builds the DAG automatically. Every
{{ ref('model_name') }}creates a dependency edge; dbt computes the topological order at compile time. You don't maintain a separate DAG file. - Tests, docs, and lineage live alongside the SQL. dbt tests (
not_null,unique, custom data tests) run in the same dbt run that materialized the models. A separate orchestrator can't do this without re-implementing dbt's graph. - Selectors enable fine-grained partial runs.
dbt run --select state:modified+ tag:nightlyruns only changed models and their downstream graph;--excludeskips known-broken parts. This is the operational primitive for incremental development. - on-run-end / on-run-start hooks handle warehouse-level concerns. Grants, vacuum, materialized-view refreshes, and audit-log writes run as hooks tied to the dbt run, not as a separate Airflow task.
The canonical pattern — a dbt run with selectors and on-run-end hooks:
# Run only the modified models and their downstream graph (state-aware).
dbt run \
--select state:modified+ tag:nightly \
--exclude tag:experimental \
--vars '{"start_date": "2026-04-29"}' \
--target prod
# In dbt_project.yml, on-run-end runs after a successful dbt run.
# on-run-end:
# - "GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO ROLE bi_reader"
# - "{{ audit_log_run_complete() }}"
# Run tests separately so failures are visible in CI without re-materializing.
dbt test --select state:modified+ --target prod
The external orchestrator (Airflow / Dagster) typically wraps this with a single task per dbt project (or per major dbt subgraph), plus tasks for the upstream loaders (Fivetran sync, Airbyte sync) and downstream consumers (Reverse-ETL push, dashboard refresh, ML feature export). The orchestrator does not maintain its own task-per-SQL-model DAG — that would duplicate dbt's graph and create drift.
The 2026 idiomatic stack:
- EL layer: Fivetran or Airbyte lands raw data on a schedule.
- Warehouse: Snowflake / BigQuery / Databricks / Redshift holds raw + staging + analytics schemas.
- dbt: owns staging → intermediate → marts transformation, plus tests and docs.
- Airflow / Dagster: schedules the EL sync, then the dbt run, then any downstream consumers. One DAG per logical pipeline, one task per logical step.
- Reverse-ETL (Hightouch, Census): syncs warehouse marts back to operational tools (Salesforce, HubSpot, Marketo).
The senior+ bar: you can describe this stack from memory, you understand which tool owns which slice of the orchestration graph, and you don't reach for the external orchestrator to duplicate work dbt already does. The dbt blog (getdbt.com/blog) and Reis & Housley's Fundamentals of Data Engineering (O'Reilly) cover this layered architecture in depth.
Frequently asked questions
- Is Airflow still worth learning in 2026 given Dagster and Prefect?
- Yes — Airflow remains the dominant orchestrator at most enterprise data shops, and the operator ecosystem, managed offerings (MWAA, Cloud Composer, Astronomer), and tribal knowledge dwarf every alternative. Senior data engineers should be Airflow-fluent first, Dagster / Prefect second. The right learning order: Airflow TaskFlow API → idempotency / backfill patterns → Dagster software-defined assets → Prefect dynamic flows. Knowing all three lets you pick the right tool per use case.
- When should I pick Dagster over Airflow?
- When asset lineage is a first-class concern, when the team wants strong typing on inputs and outputs, when IO managers (decoupling compute from storage) reduce duplicated boilerplate, or when you're building a greenfield platform without legacy Airflow operators. Dagster's software-defined-assets model is genuinely better than Airflow's task-centric model for analytics-engineering work — but you trade ecosystem breadth for it. The dagster.io/blog covers the trade-offs in depth.
- Should I use dbt Cloud or self-hosted dbt Core?
- dbt Core is free, open-source, and runs anywhere. dbt Cloud adds a hosted scheduler, IDE, CI integration, and semantic-layer features for a per-developer SaaS price. Most large enterprises run dbt Core with their existing Airflow / Dagster scheduler; smaller teams or analytics-engineering-led shops favor dbt Cloud for the integrated experience. The decision is operational (do you want to run the scheduler yourself?) more than technical.
- What's the right retry policy for an Airflow task?
- Distinguish transient from permanent failures. Transient (network blips, rate limits, warehouse transient errors): retry 3-5 times with exponential backoff. Permanent (schema mismatch, auth failure, malformed input): fail immediately and alert. Airflow's default 'retries=3, retry_delay=5min, retry_exponential_backoff=True' is the right starting point. The anti-pattern: 'retries=10' on every task without distinguishing failure modes — this masks bugs and delays incident response.
- How do I backfill a six-month range without overwhelming the warehouse?
- Three operational levers: (1) cap concurrency on the orchestrator side — Airflow's max_active_runs / max_active_tis_per_dag limit how many partitions execute in parallel; (2) size the partition appropriately — daily partitions backfill faster than monthly; hourly is overkill for most analytics; (3) prioritize cold partitions during off-peak hours so the production daily run doesn't compete for warehouse compute. Senior+ teams instrument warehouse credit usage so backfills don't burn the monthly budget.
- Should every pipeline use ELT?
- For warehouse-resident analytics, almost always yes. The exceptions are real but narrow: PII redaction at source (transform must happen before landing), streaming ingestion to non-warehouse consumers (transform happens in Flink / Kafka Streams), compute-bound transformations the warehouse can't handle (ML feature engineering, large-scale graph algorithms), and operational data stores as the destination (Postgres / MySQL serving an app). Default to ELT; deviate only when one of these conditions applies and document why.
- What's the difference between Fivetran and Airbyte?
- Fivetran is a fully managed SaaS — you point it at a source and a destination and it handles connector code, schema drift, scheduling, and operational monitoring. Pricing is per-MAR (monthly active rows) and scales unpredictably. Airbyte is open-source — you host it yourself (or use Airbyte Cloud), and you take on the operational burden. The trade-off: Fivetran's spend can hit six figures annually for a mid-size data team; Airbyte's operational cost is a half-time SRE / data-platform engineer. Most 2026 teams use both: Fivetran for high-value, complex SaaS sources; Airbyte for everything else.
- How do I handle CDC (change data capture) from a Postgres source?
- Debezium reading the Postgres logical replication slot, writing to Kafka, then consuming into the warehouse via Kafka Connect / Snowflake Connector / BigQuery Subscriptions / Databricks Auto Loader. Confluent's blog (confluent.io/blog) is the canonical source for CDC patterns. The senior+ bar: you understand replication-slot management (slots that fall behind block WAL recycling and fill the source disk), schema-evolution handling (dropped columns, type changes), and exactly-once vs at-least-once delivery semantics in your downstream consumer.
- What's the right way to test a data pipeline?
- Three layers. (1) Unit tests on transformation logic (pytest for Python, dbt unit tests for SQL since dbt 1.8). (2) Data tests on the output (dbt's not_null, unique, accepted_values, custom singular tests; Great Expectations for non-dbt pipelines). (3) Integration tests that run the full pipeline against a synthetic dataset and assert on the output. Senior+ teams run all three in CI; the data tests run in production after every materialization and alert on failure.
- How do I pick a partition strategy?
- Date-based partitions are the default — daily for analytics, hourly for high-volume event streams, monthly for slow-changing dimensions. The partition column should be derived from the source event time (when the event happened), not the ingest time (when the pipeline saw it), because ingest-time partitioning makes backfills non-deterministic. Maxime Beauchemin's 'functional data engineering' essay on Medium (maximebeauchemin.medium.com) is the canonical reference for partition design.
Sources
- Apache Airflow Documentation. Canonical reference for DAG authoring, TaskFlow API, scheduling, and the operator ecosystem.
- Dagster Blog. Canonical for software-defined assets, asset lineage, and the modern data-platform architecture argument.
- dbt Labs Blog. Canonical for dbt-as-orchestrator patterns, incremental models, semantic layer, and analytics-engineering practice.
- Maxime Beauchemin (Airflow creator) on Medium. Canonical for functional data engineering, idempotency, and partition-driven design philosophy.
- Confluent Blog. Canonical for CDC patterns, Kafka Streams, exactly-once semantics, and streaming-vs-batch trade-offs.
- Reis & Housley — Fundamentals of Data Engineering (O'Reilly). Canonical textbook on the data-engineering lifecycle, undercurrents, and modern stack architecture.
- Airbyte Documentation. Canonical for open-source EL connector patterns, custom connector development, and self-hosted ingestion.
About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.