Data Engineer Hub

Data Modeling and Warehousing

In short

Data modeling is the part of data engineering that survives the longest and rewards the most disciplined thinking. Pipelines get rewritten every two years; table designs you ship today will still be feeding dashboards, ML features, and regulatory reports a decade from now. Get the grain wrong, denormalize the wrong axis, or pick the wrong table format and you will be paying for that decision every time the business asks a new question. The modern stack has not abolished this craft; it has multiplied the choices. You now have to decide between Kimball star schemas, Data Vault 2.0, and the Activity Schema at the modeling layer; between Iceberg, Delta, and Hudi at the table-format layer; and between bronze / silver / gold layering, staging / intermediate / marts in dbt, or a Data-Vault-fed Kimball stack at the architecture layer. The senior bar in 2026 is making these choices deliberately, with reasons that survive a calibration review, not by default or by whatever the last team did.

Key takeaways

  • Star schemas win for human-readable analytics; snowflake only when the dimension is genuinely large or shared.
  • SCD Type 2 with effective-from/effective-to windows and a row-hash change detector is the production default.
  • Bronze/Silver/Gold is a contract architecture; each layer promises something different to its consumers.
  • Joining facts to dimensions on the surrogate key valid at the time of the event is what makes SCD Type 2 worth the cost.
  • Iceberg for multi-engine, Delta for Databricks-native, Hudi for streaming-first CDC. The catalog is the real lock-in.
  • Data Vault 2.0 belongs in the integration layer, not the serving layer; never expose hubs and satellites to BI tools.

Kimball stars beat Data Vault and Activity Schema for serving

The right way to frame the modeling-paradigm question in 2026 is not which is best but which belongs at which layer. Three paradigms are alive: Ralph Kimball's dimensional model (star and snowflake schemas), Dan Linstedt's Data Vault 2.0 (hubs / links / satellites), and Ahmed Elsamadisi's Activity Schema (the Narrator model). Bill Inmon's third-normal-form enterprise data warehouse is the fourth historical paradigm; it is mostly gone from greenfield builds in 2026, surviving in finance and regulated industries where the original warehouse predates the lakehouse era.

Kimball dimensional modeling is the default for the serving layer (the Gold / marts layer that BI tools, analysts, and ML features read). A fact table holds the events at a chosen grain (one row per order line, one row per page view, one row per session); dimension tables hold the descriptive context (customer, product, store, date). The Kimball Group's published patterns at kimballgroup.com/data-warehouse-business-intelligence-resources are still the canonical reference; the discipline is durable across warehouse, lakehouse, and federated query engines because the model is about how humans and BI tools think about the business, not about the physical storage. A senior DE who cannot draw a star schema for a checkout funnel on a whiteboard inside five minutes will fail a Snowflake or Databricks loop; Tristan Handy and the dbt-Labs team have been blunt that the analytics-engineering hire bar is fluency with Kimball.

Data Vault 2.0 belongs at the integration layer, not the serving layer. The hub / link / satellite decomposition (one hub per business key, one link per relationship, one satellite per attribute group with load timestamps) buys two things the star schema does not: painless schema evolution as upstream systems drift, and auditable lineage at the row level for regulated environments. It pays for those benefits with brutal query complexity; analysts cannot read a Data Vault. Use it as the Silver layer for many-source integrations (think a bank merging fifteen core-banking systems after an acquisition) and build Kimball stars on top for Gold. Skipping straight from raw Bronze to a Kimball star is the right call for most teams under fifty sources.

The Activity Schema (the Narrator model, popularized by Ahmed Elsamadisi) compresses an entire warehouse into one append-only table with the columns customer_id, ts, activity, feature_json, plus computed activity_occurrence and activity_repeated_at. Every metric is a SQL pattern over this single table. The pitch is real (one table, two columns of indexing discipline, every behavioural metric becomes a window-function query); the critique is also real. The Activity Schema punishes you at the join boundary; once analytics has to combine behavioural events with non-event dimensional data (price tier, product hierarchy, marketing attribution windows) the elegance dissolves and you end up rebuilding a star schema next to it. The wrong answer is to ship a pure Activity Schema as the entire warehouse and tell the BI team to write window functions; the right answer is to use it for the behavioural-funnel subdomain where it shines and keep Kimball as the rest of the Gold layer.

The senior decision boils down to a paragraph the candidate has to be able to deliver in a system-design interview: Kimball at the serving layer, Data Vault only at the integration layer when the source count or the audit requirement forces it, Activity Schema in a specific behavioural domain when it earns its keep, and Inmon-style 3NF mostly off the table for new builds. Fundamentals of Data Engineering (O'Reilly) argues this layered position at depth, and it is what calibration committees at Snowflake, Databricks, and dbt Labs hear when they hear a senior answer.

SCD Type 2 is the production default; know when Type 4 and Type 6 win

Slowly Changing Dimensions are where modeling discipline pays off most concretely. A dimension attribute changes (customer moves from segment 'sme' to segment 'enterprise', product moves from category 'health' to category 'fitness', employee moves departments). The question is what your historical fact rows say about that customer, product, or employee. Ralph Kimball's original SCD taxonomy (Types 0 through 7) is still canonical at the Kimball Group's published techniques page; four of those types actually ship in production.

Type 1 (overwrite) is for attributes where history does not matter and the new value is the only one anyone should ever see. Customer email is the canonical example: when it changes, every historical fact row should attribute to the new email because there is only one customer. Type 1 is cheap; one MERGE statement, no extra rows, no surrogate-key gymnastics.

Type 2 (history rows) is the production default for any attribute whose history matters for analytics. The dimension carries an effective_from timestamp, an effective_to timestamp (or NULL for the current row), an is_current flag, and a surrogate key that changes every time the natural key gets a new historical state. Facts join on the surrogate key valid at the fact event timestamp, which is what makes the model worth its cost. The production failure mode is the one that hits every team eventually: finance reorganises customer segments in March, the dimension overwrites segment values without versioning, and the December board deck no longer matches the December numbers you ran in January. The whole point of Type 2 is that the December board deck never changes; the December fact rows are joined to the segment value that was true in December. If you cannot reproduce a board deck six months after the fact, you do not have a Type 2 dimension; you have a Type 1 dimension with extra columns.

Type 4 (history table) splits the dimension into a current-state table and a separate history table. The current-state table is small and fast (BI tools and operational queries hit it); the history table is large and append-only (the few analyses that need history hit it). Type 4 is the right call when the dimension is wide and most queries do not need the historical view; the split-table pattern keeps the hot path narrow. The trade-off is two tables to maintain and a slightly more complex join pattern.

Type 6 (hybrid) is Type 2 plus a Type 1 current-value column on every row. The historical rows still hold the value that was true at the time, but every row also carries a current_segment column that gets overwritten when the segment changes. This is the right call when analysts need to answer two questions with the same join: what was the segment at the time of the order (historical join, the Type 2 column) and what is the segment now for the customer who placed that order (current join, the Type 1 column). Type 6 is more expensive to maintain than Type 2 because every segment change has to fan out across all historical rows for that customer; do not adopt it on a dimension with rapid attribute churn.

The 2026 idiomatic implementation pattern in dbt is dbt snapshots with the check strategy and a row-hash column. The snapshot config detects any change to the watched columns, closes out the existing row by stamping dbt_valid_to, and writes a new row with dbt_valid_from set to the run timestamp. The row hash keeps the diff cheap on wide dimensions:

{% snapshot dim_customer_snapshot %}
{{ config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='check',
    check_cols=['segment','region','tier','status'],
    invalidate_hard_deletes=True
) }}

select
    customer_id,
    segment,
    region,
    tier,
    status,
    updated_at
from {{ source('crm','customers') }}
{% endsnapshot %}

The dbt snapshot manages the dbt_scd_id surrogate, the dbt_valid_from / dbt_valid_to windows, and the dbt_updated_at column without further ceremony. Downstream marts then join facts to the snapshot using the standard between valid_from and coalesce(valid_to, '9999-12-31') predicate on the event timestamp. Type 2 done right is durable; Type 2 done wrong corrupts every historical analysis.

The dbt-layered warehouse is the 2026 default; treat each layer as a contract

The modern dbt stack has converged on a three-layer model that maps onto the lakehouse Bronze / Silver / Gold pattern: staging for one-to-one source cleanups, intermediate for re-usable transformations, and marts for the fact and dimension tables that BI tools, ML features, and reverse-ETL syncs consume. The dbt project-structure guide is the canonical reference; the architecture is enforced by naming convention (stg_*, int_*, fct_*, dim_*), folder structure (models/staging/, models/intermediate/, models/marts/), and dependency discipline (a mart never depends on a raw source directly, only on a staging model).

The contract per layer is what makes the model worth the discipline. Staging promises one model per source table with column renaming, type casting, and light cleanup (trim whitespace, NULL-out sentinel values), nothing else. Intermediate promises re-usable joins and aggregations that more than one mart will consume. Marts promise the analytical model the business reads; the names match the business domain, the grain is documented, and the tests are tight. A senior DE never lets a stakeholder query a staging model directly; staging is a kitchen, not a menu.

The 2026 senior bar runs dbt tests at every layer, not just at marts. The four built-in generic tests (unique, not_null, accepted_values, relationships) catch most schema drift; dbt-utils adds unique_combination_of_columns, recency, and expression_is_true; dbt-expectations ports a Great-Expectations suite into the dbt test runner. Source freshness checks (dbt source freshness) catch the case where the upstream EL loader has stopped landing rows but the dbt run still succeeds against stale data; without freshness tests the warehouse looks healthy while every dashboard goes silently wrong.

Incremental models are where most teams trip. The default materialized='incremental' config with an {% if is_incremental() %} where event_ts > (select max(event_ts) from {{ this }}) {% endif %} predicate works for pure append-only event tables and breaks the first time a late-arriving fact lands. The right pattern depends on the source guarantees:

  • Append-only, no late arrivals: the high-water-mark predicate is fine. incremental_strategy='append' on BigQuery / Snowflake / Databricks.
  • Late arrivals within a known window (e.g., 72 hours): use a lookback predicate (event_ts > dateadd(hour, -72, current_timestamp)) with incremental_strategy='merge' on a unique_key. The MERGE pattern overwrites any late-arriving row that matches the key.
  • Late arrivals are unbounded (e.g., correction events that backdate by months): either re-process the full window on each run (cost trade-off) or model the late arrivals as a separate correction stream that updates the original mart via an explicit MERGE.
  • Idempotency: always set unique_key on incremental MERGE models. The model must produce identical output regardless of how many times it ran or whether yesterday's run partially failed.

Tristan Handy and the dbt Labs team have argued at the dbt Labs blog that incremental modelling is where the analytics-engineering profession actually gets measured; teams that ship pure-table materializations for everything are paying warehouse compute that does not need to be paid, and teams that ship broken incrementals are silently corrupting their marts.

Partition pruning and clustering keys decide query cost more than the model

Once the model is right, query performance lives or dies at the physical-layout layer. Each of the three big warehouses in 2026 (Snowflake, BigQuery, Databricks SQL) exposes a slightly different abstraction over the same fundamental trick: cluster the data on disk so the query planner can skip the partitions, files, or micro-partitions that cannot contain matching rows. Get this wrong and a senior data engineer will watch a dashboard scan the whole fact table on every refresh while the warehouse bill climbs into the kind of bands levels.fyi compensation surveys show senior DEs drawing in cash comp; you are paying for the same scan over and over.

Snowflake micro-partitions and clustering keys. Snowflake auto-partitions every table into immutable 50-to-500-MB micro-partitions and tracks the min / max of every column per micro-partition in the metadata; this is the zone-map index. Queries that filter on a high-cardinality column (event_ts, customer_id) prune at the metadata layer and never read the bulk of the table. Snowflake's documentation on micro-partitions and data clustering spells the model out. A CLUSTER BY (event_date) directive tells the auto-clustering service to keep the table physically sorted on event_date so the zone maps stay tight; without it, large unsorted tables degrade and the auto-clustering credits earn back the lost scan time. The senior decision: cluster on the column the high-volume queries filter on, not on the primary key.

BigQuery partitioning + clustering. BigQuery exposes the two layers explicitly. Partitioned tables split the data into physical partitions by date or by an integer range; the query planner skips partitions whose pseudo-column (_PARTITIONDATE / _PARTITIONTIME) does not match the filter. Clustered tables then sort rows within each partition on up to four clustering columns. The idiomatic pattern is partitioning on event_date and clustering on (customer_id, country, event_type); this collapses a fact-table scan that would cost many gigabytes of bytes-billed into a query that reads a few hundred MB. Be ruthless about it; BigQuery pricing is bytes-scanned, and partition-pruning failures show up directly on the monthly bill.

Databricks Delta Z-ordering and Liquid Clustering. Delta tables expose OPTIMIZE ... ZORDER BY for the legacy clustering strategy and Liquid Clustering for the 2024-onward default. Z-order computes a space-filling-curve order across multiple columns so that multi-dimensional range queries can co-locate matching files; Liquid Clustering removes the physical-partition coupling and lets the cluster keys evolve over time without rewriting the table. Reynold Xin and the Databricks engineering blog have written extensively about why Liquid Clustering supersedes the old PARTITIONED BY + Z-order pattern; greenfield Delta tables in 2026 should default to Liquid Clustering on (event_date, customer_id) and rely on the cluster-by spec rather than physical partitions.

Reading the query plan is the senior skill. Snowflake's EXPLAIN output exposes the per-step partition-prune ratio. BigQuery's query plan visualization shows bytes-billed and slot-ms per stage. Databricks SQL exposes the Spark UI's physical plan and the I/O metrics per file. A senior DE inspects the plan before promoting a model to production and asserts two things: the partition / cluster filter resolved at the scan stage (not after a join), and the expensive join is hash-distributed on a key with low skew. The wrong answer is to denormalize first and inspect the plan later; the right answer is to inspect the plan, decide whether a denormalization is the right intervention, and only then duplicate the column.

The interview tell that separates a senior+ candidate from a mid-level one at Snowflake, Databricks, dbt Labs, or any warehouse-native shop is the second-order knowledge: not just cluster on event_date but why clustering on customer_id alongside event_date actually hurts the cost of the dashboards that filter only on the date range, and how the auto-clustering service decides when to re-cluster. That is what the calibration loop is listening for.

Frequently asked questions

Should I model OLTP and OLAP with the same schema?
No. OLTP optimizes for row-level writes and referential integrity (3NF, narrow tables, lots of joins). OLAP optimizes for wide scans and aggregations (denormalized facts, columnar storage, few joins). Treating them as one schema starves both: writes contend with reads, indexes bloat, and analytical queries time out. Replicate from OLTP into a warehouse and remodel.
When does denormalization stop paying off?
When the redundant column starts changing faster than the fact rows do. Denormalizing `customer_segment` onto `fct_order` is fine if segments are stable; it becomes a nightmare when segments are recomputed nightly and you have to backfill billions of rows to keep history correct. The rule of thumb: denormalize attributes that are stable or whose history you do not care about; keep volatile attributes in a Type 2 dimension and join.
Is Data Vault 2.0 worth the complexity?
Only at scale and only when auditability is a hard requirement. Hubs/links/satellites give you perfect lineage and painless schema evolution at the integration layer, but they are miserable to query directly. Use Data Vault as your Silver layer, then build Kimball stars on top for Gold. Skipping straight from Bronze to a star schema is fine for most teams under 50 sources.
Which open table format should I pick: Iceberg, Delta, or Hudi?
Iceberg if more than one query engine reads the table (Snowflake external tables + Trino + Spark + Flink), because Iceberg's catalog (REST, Glue, Nessie, Polaris) is the most engine-agnostic in 2026. Delta if the team is already Databricks-native; Delta Lake's integration with Unity Catalog, Liquid Clustering, and the Photon engine is best-in-class on Databricks and weaker off it. Hudi if the workload is streaming-first CDC with frequent upserts and you need record-level indexing. The real lock-in is the catalog, not the table format; pick the catalog first.
How do I handle a late-arriving fact in an incremental dbt model?
Three patterns by source guarantee. (1) If late arrivals fall inside a known window (say 72 hours), use a lookback predicate (`where event_ts > dateadd(hour, -72, current_timestamp)`) plus `incremental_strategy='merge'` on a `unique_key`; the MERGE overwrites the late row. (2) If late arrivals are unbounded, either re-process the full window each run (cost trade-off) or model corrections as a separate stream that MERGEs into the original mart. (3) Always set `unique_key` on incremental MERGE models so the run is idempotent; without it, a retry can duplicate rows.
What is the right grain for a fact table?
The lowest grain the source can sustain that still answers the questions the business asks. One row per order line beats one row per order if the line-level discount or line-level product mix matters; one row per session beats one row per page view if the page-view volume would blow up the table but the session-level metric is what the business reports on. Document the grain in the model's dbt docstring and assert it with a `unique_combination_of_columns` test; the grain is the contract.
Why does SCD Type 2 keep biting teams in production?
Three failure modes recur. (1) The change-detection logic misses a watched column, the dimension overwrites silently, and historical fact joins corrupt without warning; the dbt snapshot `check` strategy with an explicit `check_cols` list and a row-hash column guards against this. (2) The fact-to-dimension join uses the current surrogate key instead of the surrogate key valid at the event time; every Type 2 dimension needs a `between valid_from and coalesce(valid_to, '9999-12-31')` predicate. (3) Hard deletes upstream do not propagate; set `invalidate_hard_deletes=True` on dbt snapshots so a deleted natural key closes its current row.

Sources

  1. Kimball Group; Dimensional Modeling Techniques. Canonical reference for star schemas, snowflake schemas, and the full Slowly Changing Dimension taxonomy (Types 0 through 7).
  2. Apache Iceberg; latest documentation. Canonical reference for table-format spec, hidden partitioning, schema evolution, and the REST / Glue / Polaris catalog options.
  3. Delta Lake; official documentation. Canonical reference for Delta protocol, ACID transactions, OPTIMIZE / ZORDER, and Liquid Clustering.
  4. Apache Hudi; overview and concepts. Canonical reference for record-level indexing, Copy-on-Write vs Merge-on-Read tables, and streaming CDC ingestion patterns.
  5. dbt Labs; Snapshots documentation. Canonical reference for SCD Type 2 implementation in dbt, the check vs timestamp strategy, and the dbt_valid_from / dbt_valid_to / dbt_scd_id metadata columns.
  6. dbt Labs; project structure guide. Canonical reference for the staging / intermediate / marts layering pattern and naming conventions.
  7. Snowflake; micro-partitions and data clustering. Canonical reference for the Snowflake zone-map index, auto-clustering service, and CLUSTER BY directive.
  8. Google BigQuery; partitioned tables. Canonical reference for date / integer-range partitioning, partition pruning, and the _PARTITIONDATE pseudo-column.
  9. Google BigQuery; clustered tables. Canonical reference for clustering-column ordering, the four-key limit, and partition + clustering interaction.
  10. Reis & Housley; Fundamentals of Data Engineering (O'Reilly). Canonical textbook on the data-engineering lifecycle, modeling-layer trade-offs, and lakehouse architecture argument.
  11. levels.fyi; per-company compensation by level. Source for Data Engineer total-compensation bands at Snowflake, Databricks, dbt Labs, and FAANG-tier shops; use per-company filters and the published bands rather than inferring.

About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.