Data Engineer Hub

Data Quality and Observability

In short

Data quality decides whether your warehouse is a decision-making asset or a liability dashboard. Barr Moses (Monte Carlo's co-founder) coined 'data downtime' to make the analogy explicit: the hours or days when data is missing, late, or wrong; and the cascading damage to dashboards, ML features, and executive trust. Modern data quality has three layers. In-pipeline assertions: dbt tests that fail the build when an invariant breaks. Observability: passive monitoring of freshness, volume, schema, and distribution to catch failures dbt cannot anticipate. Contracts: schemas and SLAs negotiated with upstream producers, validated at the ingest gateway, versioned like APIs. Teams that ship only the first layer miss silent failures; teams that buy only the second pay for monitoring without the assertions that catch the predictable breaks. The senior bar is choosing which layer earns which problem.

Key takeaways

  • dbt tests on every primary key, foreign key, and accepted-values column are the non-negotiable baseline. Run them on Bronze and Silver, not just Gold.
  • Data observability (Monte Carlo, Soda, Anomalo, Bigeye) catches the freshness, volume, schema, and distribution failures that dbt tests cannot anticipate.
  • Barr Moses's 'data downtime' framing made data quality a measurable business metric; minutes of staleness or wrongness, with the same severity as application downtime.
  • Lineage tools (Atlan, OpenMetadata, DataHub) answer 'what breaks if I change this column?' in minutes rather than days. Column-level lineage beats table-level for impact analysis.
  • Data contracts are the 2026 senior-DE bar: schemas, freshness SLAs, and ownership negotiated with upstream producers, validated at the gateway, versioned like APIs.
  • Pinterest's Data Quality Canvas and the Convoy/GoCardless contract case studies report fewer downstream incidents after quality enforcement moves from downstream cleanup to upstream contract validation; specific incident-reduction percentages vary by team and aren't directly attributable to a single primary source.
  • The wrong architecture: zero in-pipeline tests plus an expensive observability tool. The right architecture: dbt tests as assertions, observability for the unknown unknowns, contracts on the interfaces that matter.

The 2026 data quality vocabulary: data downtime, SLOs, and the Six Dimensions

Senior data engineers in 2026 talk about data quality in two vocabularies: the older DAMA Six Dimensions framing (completeness, validity, accuracy, consistency, uniqueness, timeliness) and the newer SLO framing pushed by Barr Moses and the Monte Carlo team. The DAMA dimensions are useful for governance documents and audit checklists; they are less load-bearing for the operational work of running a data platform on call. The SLO framing treats data the way Google SRE treats services: pick a small set of measurable signals (freshness, volume, schema, distribution, lineage), set targets, page when the target is missed, blameless-postmortem the misses. That language ports directly from application reliability and gets you a shared on-call vocabulary with platform and SRE peers.

'Data downtime' as the unit of measure. Barr Moses's reframe was the rhetorical move that made data quality a budget line. Instead of arguing about whether a Gold dashboard was 'pretty accurate', a team can report 'forty-eight minutes of data downtime on the revenue mart this week, two incidents, one self-healed, one paged' in the same forum where the application team reports site uptime. The metric is computed: minutes between the start of an incident (freshness miss, schema break, distribution anomaly, volume drop) and the time downstream consumers can trust the table again. Pinterest's Data Quality Canvas captures the same shift; quality as a shared discipline with named owners across producers, platform, and consumers, not an undifferentiated 'data team' responsibility.

Five signals that matter on every fact table. Freshness: when did the table last receive new rows; is the lag inside the SLA the downstream contract committed to. Volume: is today's row count inside the expected band for this day-of-week and hour; a fact table that usually lands 4.2 million rows and today landed 600 thousand is the canonical silent failure. Schema: did upstream add a column, drop a column, change a column's type, change its nullability. Distribution: did the values in a load-bearing column shift (more NULLs, a new category code, the median doubled). Lineage: what upstream tables feed this; what downstream tables and dashboards consume it. Every modern observability vendor (Monte Carlo, Bigeye, Anomalo, Soda) instruments these five; the differentiator is how good their detection is on distribution and how cheaply the freshness checks scale.

dbt tests vs data observability: when each is the right tool

The most common mistake is treating dbt tests and an observability vendor as substitutes. They solve different problems and a serious data platform runs both.

dbt tests are write-time assertions. They run inside the transformation, against the rows the model just produced, and they fail the build if an invariant breaks. The built-in suite covers the boring foundational checks: unique on primary keys, not_null on the columns that have to be populated, accepted_values on bounded enumerations (status codes, country codes), relationships for referential integrity between models. The dbt-utils and dbt-expectations packages extend the surface to row-count comparisons, equal rowcounts across environments, recency, mutually exclusive ranges, and a long tail of statistical checks. Singular tests (a raw SQL file under tests/) cover the team-specific invariants the generic tests cannot articulate; revenue equals sum of line items, no orders with no customer, the SCD-2 end-date column is non-null only on closed records.

-- dbt singular test: revenue reconciliation invariant
-- tests/revenue_reconciles_to_line_items.sql
WITH order_revenue AS (
    SELECT order_id, total_revenue
    FROM {{ ref('fct_orders') }}
),
line_revenue AS (
    SELECT order_id, SUM(line_revenue) AS summed_lines
    FROM {{ ref('fct_order_lines') }}
    GROUP BY order_id
)
SELECT o.order_id, o.total_revenue, l.summed_lines
FROM order_revenue o
JOIN line_revenue l USING (order_id)
WHERE ABS(o.total_revenue - l.summed_lines) > 0.01;
-- Test passes when this query returns zero rows.

Observability is runtime monitoring. Monte Carlo, Bigeye, Anomalo, and Soda watch tables you have not thought to write tests against. They learn the freshness rhythm of every table in the warehouse, the volume profile by day-of-week, the distribution of every column above some cardinality threshold, and they alert when today's reality drifts from yesterday's baseline. That catches the failure mode dbt tests structurally cannot: the column whose default behavior changed upstream and started landing mostly NULL, the categorical column that suddenly has a new value the warehouse model has never seen, the join key that silently lost referential integrity because a source-system migration paused the foreign-key constraint for one weekend.

The senior decision matrix. dbt tests are mandatory; they are cheap to write, run inside CI, and document the invariants the model holds. Run them on Bronze (ingest correctness), Silver (conformance and referential integrity), and Gold (the business-meaningful checks). An observability vendor earns its keep past series-B, past a few thousand tables, when the cost of an undetected silent failure exceeds the platform fee. Below that scale, a dozen cron-scheduled freshness and volume queries plus dbt's freshness check on sources is usually enough.

Lineage as the operational backbone: OpenLineage, DataHub, Atlan, Marquez

Lineage is the question 'if I change this column, what breaks downstream?' answered in minutes instead of an afternoon of grepping the dbt manifest and pinging analysts on Slack. The dbt DAG covers warehouse-internal lineage; it stops at the source ingest and at the warehouse boundary. A serious lineage layer extends the graph upstream into application databases and Kafka topics, and downstream into BI dashboards, reverse-ETL syncs, and feature stores.

OpenLineage (openlineage.io) is the standard the 2026 ecosystem is consolidating on. It is an open spec for emitting lineage events from any pipeline runtime (Airflow, Spark, dbt, Flink) into any backend (Marquez, DataHub, Atlan, Monte Carlo). The value of an open spec is that swapping the metadata backend does not mean rewriting every integration. Pick OpenLineage emission at every pipeline boundary you can, even if the current backend is one vendor; the optionality is worth the small wiring cost.

The build-vs-buy decision. Three credible open-source backends ship column-level lineage in 2026: DataHub (LinkedIn's open-sourced metadata platform; see LinkedIn Engineering's architecture writeup), OpenMetadata, and Marquez (the OpenLineage reference backend). Three commercial options dominate: Atlan, Monte Carlo's catalog product, and Acryl (commercial DataHub). The honest senior calculus: at a one-warehouse, one-BI-tool shop, dbt docs plus dbt's exposures block is the right answer; the lineage need is small and the operational cost of running a metadata service is real. At twenty-plus source systems, multiple BI tools, ML feature stores, and reverse-ETL pushing data back to Salesforce and HubSpot, the dbt DAG is a fraction of the picture and a dedicated lineage tool earns its operations cost. Atlan tends to win on the producer/consumer UX; DataHub wins when the platform team wants to extend the metadata model.

Column-level beats table-level. A common 2026 trap is buying a lineage product that only tracks table-to-table edges. The interesting question is column-level: 'we are about to drop the legacy_user_email column on the user source; which dbt models, which dashboards, which feature-store features, which reverse-ETL syncs consume it?'. Table-level lineage answers 'something in these forty tables maybe'; column-level lineage answers 'these eleven dbt models and these three dashboards, here are the owners'. The latter is the version that saves the migration RFC.

Incident response: on-call rotation, severity tiers, and 5 Whys for data

Mature data platforms run an on-call rotation against a paging tool (PagerDuty, Opsgenie) with the same operational discipline as the application platform. The senior DE bar at any data-platform-centric shop in 2026 is articulating what gets paged, what waits for business hours, and how the blameless postmortem closes the loop.

Severity tiers, mapped to who gets paged. Sev0: revenue-affecting dashboards are wrong, or a regulatory report is wrong; the CEO, the CFO, or a regulator is the next person to notice; page the primary on-call immediately. Sev1: an ML model retrained on bad data, or a production feature store is serving stale or wrong features; page during business hours, escalate after thirty minutes. Sev2: a freshness SLA on a Gold table is breached beyond four hours, or a Silver-layer schema test failed but the Gold consumers are still on yesterday's snapshot; Slack alert, review the same business day. Sev3: distribution drift, soft volume anomalies, an experimental dbt model that nobody consumes; weekly review, no page.

Blameless postmortems and the 5 Whys. The Google SRE Book's postmortem template ports almost without edit to data incidents. The shape: what happened, when did each step happen, who was affected, what was the root cause, what is the corrective action, what is the preventive action. The 5 Whys discipline lands the conversation on a real cause instead of a symptom. Worked example: a Sev1 fires because the customer SCD-2 effective_date column landed wrong for the previous Tuesday's batch. Why did it land wrong? Because the transformation read the source's updated_at as UTC when the source had just migrated to America/New_York. Why did the transformation read it as UTC? Because the column type in the source schema did not change; only the timezone of the producing service changed. Why did we not catch the timezone change? Because we had no schema contract with the upstream producer covering semantic changes, only structural ones. Why did we have no semantic contract? Because the data platform owned the SCD-2 logic but did not own the producer's deploy review. Why did the producer's deploy review not include us? Because there was no documented downstream-consumer notification in the release checklist. Corrective: backfill the affected period. Preventive: add the data platform as a required reviewer on any schema change to the customer service and capture the semantic contract in Protobuf field documentation.

Alert fatigue is the failure mode that kills the discipline. The fastest way to ruin a quality program is to page on warnings. If a distribution anomaly with low precision pages the on-call twice a week, the third real incident gets clicked through and ignored. The honest discipline: every alert that pages must have an action; every alert that does not have an action is a Slack message at most. Review the page-rate weekly and ratchet down noisy detectors until the on-call trusts the pager again.

Data contracts and distribution drift: the senior FAANG/fintech bar

The 2026 senior-DE interview signal splits into two questions: do you write distribution checks on the columns that feed models, and do you negotiate schema contracts with the upstream producers your warehouse depends on. Both are downstream of the same insight: most silent data failures originate outside the warehouse, and shifting the validation upstream is the only durable fix.

Distribution drift checks on every model-feeding fact table. The Kolmogorov-Smirnov two-sample test (scikit-learn documentation covers the surrounding statistical workflow; the test itself is in scipy.stats.ks_2samp) gives a principled answer to 'is today's distribution of order_value the same as last week's'. For categorical features (country code, plan tier, device family), the Population Stability Index is the industry default; financial-services teams have used PSI for credit-risk model monitoring for two decades and the literature is widely available on arXiv. The senior practice: pick a small set of load-bearing columns (the ones whose drift would retrain a model or skew an exec dashboard), compute the drift statistic daily, alert on the threshold the team calibrates against historical noise. Do not boil the ocean and run drift checks on every column in every table; the precision drops and the on-call ignores the page.

Schema contracts as APIs. Chad Sanderson's 'data contracts' movement (his Substack at dataproducts.substack.com is the canonical writing) reframes the warehouse-upstream relationship as a producer-consumer API contract, not a wishful 'we will reverse-engineer your schema from your application database'. The operational shape: producers declare a schema in Protobuf or Avro, register it in a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Buf Schema Registry), and any breaking change (field removal, type narrowing, semantic shift) requires a contract version bump and consumer signoff. The ingest layer at the warehouse validates inbound payloads against the registered schema and routes the violations to a dead-letter topic instead of corrupting Bronze. The Convoy and GoCardless engineering blogs have the most useful public case studies on rolling contracts out without breaking the producer-side teams.

When contracts are over-engineered and when they are the right answer. Pre-series-B with one source system, a weekly chat with the application team covers the same ground a contract would. Series-B and beyond, with five or more source services and a contract-free ingest layer, the data team is structurally exposed to every silent deploy. The Stack Overflow Developer Survey 2024 (survey.stackoverflow.co/2024) shows the AI-adoption-driven shift in how much of the warehouse's input is machine-generated; contracts become non-optional once the volume crosses the threshold where the data team cannot manually triage every upstream change. The FAANG and fintech senior bar in 2026: every load-bearing producer-consumer interface has a versioned schema, a freshness SLA, and a named owner on both sides; everything else gets the lighter dbt-tests-plus-observability treatment.

Frequently asked questions

Do I need both dbt tests and a data observability tool?
Yes, at meaningful scale. They solve different problems. dbt tests catch the assertions you can articulate (uniqueness, referential integrity, accepted values). Observability tools catch the failures you cannot anticipate; a 30% volume drop you never thought to test, a schema change in an upstream Salesforce field, a distribution shift after a marketing experiment. At pre-series-B, dbt tests plus a few cron-based freshness checks are usually enough; past series-B with thousands of tables, an observability tool earns its keep.
Where should data quality tests live in the pipeline?
On every layer that has a contract with downstream consumers. Bronze should test for ingest completeness and source-system fidelity. Silver should test for deduplication, conformance, and referential integrity across sources. Gold should test for the business-meaningful invariants; revenue equals sum of line items, no negative quantities, every order has a valid customer. Catching a violation in Bronze is 100x cheaper than catching it in a CEO's dashboard.
How do I get application engineers to care about data contracts?
Three things shift the conversation. (1) Make the cost visible: track data incidents caused by upstream changes and report them in the same forum as application incidents. (2) Make contracts low-friction: a YAML file in their repo with a CI check is tolerable; a 30-minute meeting per schema change is not. (3) Tie it to outcomes they care about; ML model regressions, exec dashboards going dark, regulatory reports needing rework. Contract adoption is a culture change, and the data team has to do the political work to make it stick.
Is Great Expectations still relevant in 2026?
Yes, but narrower than its 2020 peak. dbt tests absorbed most in-warehouse assertion use cases; observability tools absorbed monitoring. Great Expectations remains the right pick for batch validation outside the warehouse; validating vendor CSVs before load, asserting Pandas DataFrames in ML pipelines, certifying data products across team boundaries. A library, not a platform; shines when expectation suites must travel with the data.
What is the right alerting threshold for data quality issues?
Two-tier severity, mapped to who gets paged. Errors block the build and page in business hours: PK violations, freshness SLA breaches on revenue-critical tables, schema incompatibilities. Warnings go to Slack and get reviewed weekly: distribution drift, soft volume anomalies, optional foreign-key warns. The fastest way to destroy a quality program is to page on warnings; alert fatigue makes the team ignore real signals.
How do lineage tools compare to just reading the dbt DAG?
The dbt DAG covers the warehouse but stops at the source and the dashboard. A real lineage tool extends upstream (application databases, Kafka topics, SaaS sources) and downstream (BI tools, reverse-ETL syncs, ML feature stores). Small team with one warehouse: dbt docs is enough. 20+ source systems with multiple BI tools and reverse-ETL: a dedicated lineage tool is where end-to-end impact analysis lives.
What's the Pinterest Data Quality Canvas?
A framework Pinterest's data engineering team published for structuring quality work across producers, consumers, and platform teams. It maps quality dimensions (completeness, accuracy, consistency, timeliness, validity, uniqueness) against responsibility owners and surfaces the gaps where nobody owns the failure mode; the canonical template for treating quality as a shared discipline.

Sources


About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.