Data Engineer Hub

Data Modeling and Warehousing

In short

Data modeling is the part of data engineering that survives the longest and rewards the most disciplined thinking. Pipelines get rewritten every two years; table designs you ship today will still be feeding dashboards, ML features, and regulatory reports a decade from now. Get the grain wrong, denormalize the wrong axis, or pick the wrong table format and you will be paying for that decision every time the business asks a new question. The modern stack has not abolished this craft — it has multiplied the choices. You now have

Key takeaways

  • Star schemas win for human-readable analytics; snowflake only when the dimension is genuinely large or shared.
  • SCD Type 2 with effective-from/effective-to windows and a row-hash change detector is the production default.
  • Bronze/Silver/Gold is a contract architecture — each layer promises something different to its consumers.
  • Joining facts to dimensions on the surrogate key valid at the time of the event is what makes SCD Type 2 worth the cost.
  • Iceberg for multi-engine, Delta for Databricks-native, Hudi for streaming-first CDC. The catalog is the real lock-in.
  • Data Vault 2.0 belongs in the integration layer, not the serving layer — never expose hubs and satellites to BI tools.

When to use Kimball dimensional modeling

SCD Type 2 in practice

Lakehouse architecture: Bronze/Silver/Gold layers

Iceberg vs Delta vs Hudi — when each wins

Frequently asked questions

Should I model OLTP and OLAP with the same schema?
No. OLTP optimizes for row-level writes and referential integrity (3NF, narrow tables, lots of joins). OLAP optimizes for wide scans and aggregations (denormalized facts, columnar storage, few joins). Treating them as one schema starves both: writes contend with reads, indexes bloat, and analytical queries time out. Replicate from OLTP into a warehouse and remodel.
When does denormalization stop paying off?
When the redundant column starts changing faster than the fact rows do. Denormalizing `customer_segment` onto `fct_order` is fine if segments are stable; it becomes a nightmare when segments are recomputed nightly and you have to backfill billions of rows to keep history correct. The rule of thumb: denormalize attributes that are stable or whose history you do not care about; keep volatile attributes in a Type 2 dimension and join.
Is Data Vault 2.0 worth the complexity?
Only at scale and only when auditability is a hard requirement. Hubs/links/satellites give you perfect lineage and painless schema evolution at the integration layer, but they are miserable to query directly. Use Data Vault as your Silver layer, then build Kimball stars on top for Gold. Skipping straight from Bronze to a star schema is fine for most teams under 50 sources.

Sources


About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.