Data Scientist / ML Engineer Hub

SQL and Data Modeling for Data Scientists (2026)

In short

SQL is the dividing line between mid and senior product DS. The junior bar is single-table SELECT plus GROUP BY plus INNER vs LEFT JOIN under stress. The senior bar is window functions written from instinct (ROW_NUMBER, SUM OVER PARTITION BY, LAG, LEAD, PERCENTILE_CONT), cohort retention written without looking it up, a coherent opinion on dimensional modeling vs the Activity Schema, query-plan literacy on at least one of BigQuery, Snowflake, or Databricks, and dbt fluency at the staging-intermediate-marts level. A senior DS who falls back to self-joins where a window function is correct, or who writes SELECT star in a production pipeline, gets read as mid-level no matter the years on the resume.

Key takeaways

  • Window functions are the senior tell. ROW_NUMBER OVER (PARTITION BY user_id ORDER BY event_time) for deduplication, SUM OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) for running totals, LAG and LEAD for time-deltas, PERCENTILE_CONT(0.5) for medians without GROUP BY gymnastics. Stack Overflow's 2024 Developer Survey (survey.stackoverflow.co) shows SQL at the top of the most-used languages among data professionals.
  • Cohort retention is a recurring interview prompt. The senior answer uses a single CTE for cohort assignment, a JOIN against the activity table, and a window function or aggregation to compute the rate; not five nested self-joins. The bar is recognizing that retention is a denominator question (week-0 active is the denominator), and that the trigger event (signup vs first session vs first purchase) must be explicit before any SQL gets written.
  • Dimensional modeling (Kimball star schema) is still the production default in 2026, with the Activity Schema (Narrator) and Data Vault as named alternatives. The senior conversation: when do facts and dimensions win, when does the Activity Schema's single-table-of-events shape pay off, and why slowly-changing dimensions (SCD Type 2 specifically) keep coming back even when teams swear they will never need history.
  • Query-engine literacy is the differentiator at staff+. BigQuery partition pruning and clustering keys, Snowflake micro-partitions and zone maps, Databricks SQL Photon and Z-ordering, Redshift sort keys and dist keys; the surface looks like ANSI SQL but the optimization layer differs. The senior debugging instinct on a slow query: read the plan first; do not start adding indexes or rewriting clauses on instinct.
  • dbt has consolidated as the dominant SQL transformation framework. The staging-intermediate-marts layering, ref() over hardcoded table names, tests on uniqueness and not-null and accepted-values, snapshots for SCD Type 2; these are the things a senior analytics engineer recognizes on sight. dbt fluency separates mid from senior at any analytics-DS-shape shop with a modern warehouse.

The window-function bar separates mid from senior

The strongest single signal in a senior product-DS SQL interview is window functions written without hesitation. A mid-level candidate, asked to deduplicate to the most recent event per user, reaches for a correlated subquery or a GROUP BY with MAX, then a self-join back to retrieve the row. A senior candidate writes:

SELECT * FROM (
  SELECT user_id, event_name, event_time,
         ROW_NUMBER() OVER (
           PARTITION BY user_id
           ORDER BY event_time DESC
         ) AS rn
  FROM events
) WHERE rn = 1;

The difference is not the result. The mid solution works. The difference is what the interviewer reads off the screen in 90 seconds: does this person think in window-function primitives, or did they learn SQL from one tutorial in 2018 and stop?

The senior surface to know cold:

  • ROW_NUMBER, RANK, DENSE_RANK. ROW_NUMBER for stable picking; RANK for tie-aware ranking; DENSE_RANK when the gap-in-ranks behavior of RANK is wrong. The interview tell: a candidate who reaches for ROW_NUMBER on a problem that needs RANK is treating window functions as one mechanical pattern instead of a family.
  • SUM and AVG with frame clauses. Running totals with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Rolling 7-day averages with RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW. The frame clause is the part most candidates skip; missing the frame is the senior debugging story.
  • LAG and LEAD. Time-delta between consecutive events per user. Sessionization on a gap threshold (when LAG over time is greater than 30 minutes, start a new session). The interview tell: candidates who self-join events to themselves on a row-offset get marked as not having internalized LAG.
  • PERCENTILE_CONT and PERCENTILE_DISC. Median and quantile computation without GROUP BY gymnastics. PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) for continuous interpolation; PERCENTILE_DISC(0.5) for the discrete-value version. Most warehouses ship both; knowing which one to use on integer-valued metrics is a small but real signal.
  • FIRST_VALUE and LAST_VALUE with frame clauses. First-touch attribution. Last-touch attribution. The frame-clause requirement on LAST_VALUE (the default frame stops at the current row, which is rarely what you want) is the canonical SQL gotcha and a frequent interview probe.

The reason this surface is load-bearing: it is what the senior DS reaches for daily. Cohort retention, sessionization, funnel definitions, time-on-task, last-purchase-before-churn; all of them are window-function problems. A senior who falls back to self-joins ships pipelines that are slower, harder to read, and easier to break.

A worked example: 7-day retention by signup cohort

One of the most common senior product-DS interview prompts. Setup: event_logs(user_id, event_name, ts). Question: compute 7-day retention rate by signup cohort week.

Before any SQL gets written, the senior asks two questions the mid does not. First: what is the cohort trigger? First session? Signup event? First purchase? The wrong choice silently shifts the numerator and denominator and produces a defensible-looking number that is not the metric the stakeholder wants. Second: what is the retention denominator? Anyone in the cohort, or only users active in week 0? The first definition is harsher; the second is the convention at most analytics-DS shops, because it controls for users who signed up and never came back at all.

The senior answer, assuming "first session is the cohort trigger" and "denominator is users active in week 0":

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('week', MIN(ts)) AS cohort_week
  FROM event_logs
  GROUP BY user_id
),
week_zero_active AS (
  SELECT DISTINCT c.user_id, c.cohort_week
  FROM cohorts c
  JOIN event_logs e ON e.user_id = c.user_id
  WHERE e.ts >= c.cohort_week
    AND e.ts < c.cohort_week + INTERVAL '7 days'
),
week_one_active AS (
  SELECT DISTINCT c.user_id, c.cohort_week
  FROM cohorts c
  JOIN event_logs e ON e.user_id = c.user_id
  WHERE e.ts >= c.cohort_week + INTERVAL '7 days'
    AND e.ts < c.cohort_week + INTERVAL '14 days'
)
SELECT w0.cohort_week,
       COUNT(DISTINCT w0.user_id) AS week_0_active,
       COUNT(DISTINCT w1.user_id) AS week_1_retained,
       1.0 * COUNT(DISTINCT w1.user_id)
             / NULLIF(COUNT(DISTINCT w0.user_id), 0) AS retention_rate
FROM week_zero_active w0
LEFT JOIN week_one_active w1
  ON w1.user_id = w0.user_id
 AND w1.cohort_week = w0.cohort_week
GROUP BY w0.cohort_week
ORDER BY w0.cohort_week;

Three things the interviewer is grading. First: the NULLIF in the denominator. A senior never divides without guarding against zero; getting back NULL when a cohort week has no week-0 actives is correct behavior. Second: the LEFT JOIN, not INNER JOIN, on week_one. Inner-joining drops cohort weeks where nobody retained, which silently hides the worst weeks. Third: the explicit cohort_week match in the join condition. Skipping it lets a week-N user count as week-1-retained for a different cohort and inflates the rate.

The window-function variant, which several interviewers prefer for the senior round:

WITH user_events AS (
  SELECT user_id, ts,
         MIN(ts) OVER (PARTITION BY user_id) AS first_ts
  FROM event_logs
)
SELECT DATE_TRUNC('week', first_ts) AS cohort_week,
       COUNT(DISTINCT user_id) AS week_0_active,
       COUNT(DISTINCT CASE
         WHEN ts >= first_ts + INTERVAL '7 days'
          AND ts < first_ts + INTERVAL '14 days'
         THEN user_id END) AS week_1_retained
FROM user_events
WHERE ts >= first_ts
  AND ts < first_ts + INTERVAL '14 days'
GROUP BY 1
ORDER BY 1;

This variant uses MIN OVER PARTITION BY to attach the first-event timestamp to every row, eliminating one of the CTEs and one of the joins. It is faster on most query engines because the optimizer can plan a single scan with one window pass instead of three CTEs materialized in sequence. The trade-off is readability; the conditional COUNT(DISTINCT CASE WHEN) is denser than the LEFT JOIN form. Senior candidates who can write both and articulate the trade-off get marked strong-hire; candidates who can only write one and treat the other as exotic get marked hire-but-not-senior.

Dimensional modeling is still the default, but know the alternatives

The dimensional-modeling conversation at the senior bar covers three named approaches and one honest answer about which one is winning in 2026.

  • Kimball star schema. Fact tables in the middle (one row per event or transaction), dimension tables on the spokes (one row per entity, conformed across facts). The dominant production pattern at every analytics-DS shop the author has seen. Ralph Kimball's "The Data Warehouse Toolkit" (kimballgroup.com) defined the playbook in the 1990s; it has aged better than any peer methodology. The reason it persists: BI tools and warehouse query engines are both optimized for the star shape. Join a single big fact to a handful of small dimensions, push filters to dimensions, aggregate the fact; the planner handles it well.
  • Inmon (Corporate Information Factory). Normalized 3NF in the core warehouse, denormalized data marts downstream. The right call when source-of-truth integrity matters more than query speed and when the analytics layer can absorb the join cost. In modern cloud warehouses with columnar storage and cheap compute, the Inmon case is weaker than it was in the on-prem era; teams that adopt it in 2026 usually have a strong governance reason, not a performance reason.
  • Data Vault 2.0. Hubs (business keys), links (relationships), satellites (descriptive attributes with history). Engineered for source-system change resilience and audit-grade history. The right call at regulated-data shops (financial services, healthcare) where the audit trail is part of the product. The wrong call at consumer-product shops where the engineering cost dwarfs the analyst benefit.
  • Activity Schema (Narrator). A single wide events table with a small number of standard columns (customer, activity, timestamp, feature_json, revenue_impact). One table, every analysis. The intellectual case is real; the empirical case in 2026 is mixed. Shops that adopted it report fast iteration for analysts who know it well and a steep onboarding cost for analysts trained on the star schema. As of 2026, the Activity Schema is a credible alternative on small teams; it has not displaced the Kimball default at most public-facing analytics shops.

The honest answer at the senior bar: dimensional modeling is the production default; the Activity Schema is the intellectually interesting alternative; Data Vault is the right call when audit history is a product requirement; Inmon is rare in 2026 outside legacy estates. A senior who can name all four, name when each wins, and not pretend the answer is one-size-fits-all gets marked strong-hire on the modeling round.

One thing that does not go away: slowly-changing dimensions. SCD Type 1 (overwrite, no history). SCD Type 2 (new row per change, with valid-from and valid-to). SCD Type 4 (history in a separate table). SCD Type 6 (a hybrid that keeps current value on the dim row plus a Type-2 history table). Junior candidates treat SCD as a textbook concept. Senior candidates have lived through the meeting where the team swore they would never need history, then six months later were paying engineers to rebuild it. The senior tell: knowing that the cheapest moment to put SCD Type 2 on a dimension is the moment you create the dimension.

Modern warehouse query engines: the SQL surface is similar, the optimizer differs

BigQuery, Snowflake, Databricks SQL, and Redshift all speak ANSI-ish SQL. The senior conversation is not "do you know the syntax" (a competent candidate can read any of them). The senior conversation is "do you understand the optimizer you are writing for."

  • BigQuery. Slot-based serverless execution. Partition pruning is the first thing the query plan does; a query that fails to filter on the partition column scans the full table and bills the full table. Clustering on high-cardinality columns (user_id, customer_id) lets the engine prune block ranges within a partition. The canonical senior bug: a query that uses WHERE _PARTITIONTIME instead of WHERE _PARTITIONDATE on a date-partitioned table and quietly scans every partition because the predicate type mismatched.
  • Snowflake. Micro-partition architecture with automatic clustering. Zone maps on each micro-partition store min and max for every column, and the query planner uses them to skip partitions whose ranges fall outside the filter. The senior tell: knowing that ORDER BY data load (or an explicit clustering key) materially improves zone-map pruning for high-cardinality filters; queries on randomly-loaded data scan more partitions than queries on naturally-ordered data, even with the same filter.
  • Databricks SQL (Photon). Vectorized C++ execution engine over Delta Lake. Z-ordering on multiple columns (a space-filling-curve sort) provides multi-dimensional locality for filters on any of the Z-ordered columns. The senior tell: understanding when to Z-order, when to partition, and when to do both; Z-ordering on a low-cardinality column wastes the sort, partitioning on a high-cardinality column produces too many small files.
  • Redshift. The veteran. Sort keys (compound or interleaved) determine block ordering; distribution keys determine how data is sharded across nodes. The senior tell: knowing that a broadcast join (small table replicated to every node) is cheaper than a redistributed merge join for most dim-on-fact patterns, and being able to read the EXPLAIN output to confirm the planner picked the right one.

The cross-engine senior instinct: when a query is slow, read the plan first. EXPLAIN ANALYZE on Postgres-family engines, query history with execution graph on Snowflake and BigQuery, Spark UI on Databricks. The wrong first move is to start rewriting clauses or adding indexes on instinct. The right first move is to find out what the engine actually did. Senior candidates open the plan; mid candidates rewrite the query and hope.

One pattern worth naming because it surfaces in every senior interview: the broadcast vs shuffle vs sort-merge join trade-off. A broadcast join replicates the small side to every executor; cheap when the small side fits in memory, catastrophic when it does not (out-of-memory failures, spill to disk). A shuffle hash join repartitions both sides on the join key; cheap when the join keys distribute evenly, catastrophic when one key holds most of the rows (skew). A sort-merge join sorts both sides on the join key; consistent performance, higher constant cost. The senior who can name when the planner picked the wrong one, and how to override the choice with a hint or a rewrite, has lived in the plans.

dbt is the dominant transformation framework in 2026

dbt (data build tool) has consolidated as the standard SQL transformation framework at analytics-DS shops with a modern cloud warehouse. The pattern is familiar enough that a candidate who has not seen it reads as out-of-date at most companies hiring in 2026.

  • The staging-intermediate-marts layering. Staging models are one-to-one with source tables, renaming columns and casting types. Intermediate models do joins and transformations that multiple marts will reuse. Marts are the analyst-facing tables (fact_orders, dim_customer). The discipline that a senior recognizes on sight: a staging model that does anything other than rename and cast is a staging model written by a mid-level engineer.
  • ref() over hardcoded table names. {{ ref('stg_orders') }} instead of warehouse.staging.stg_orders. The ref() function lets dbt compute the dependency DAG, parallelize the run, and switch between dev and prod warehouses without code changes. A senior code review catches every hardcoded table reference.
  • Tests on uniqueness, not-null, accepted-values, relationships. Four built-in tests plus the dbt_utils package add roughly ten more. The senior pattern: tests on the grain of every fact and dim table (order_id is unique on fact_orders, customer_id is unique on dim_customer), accepted-values on every enum-like column, relationships from every foreign key on the fact back to the dim. A dbt project without tests at the grain is a project that has not been touched by a senior analytics engineer.
  • Snapshots for SCD Type 2. dbt snapshots track changes to a source table over time, producing a Type-2 history table with dbt_valid_from and dbt_valid_to columns. The senior tell: starting snapshots from day one on every dimension that might need history, not waiting for the meeting where the team realizes they need to rebuild it.
  • Macros for repeated SQL. A senior dbt project uses macros to capture cross-model patterns (date spine generation, slowly-changing-dim builders, conditional aggregations across many metric columns). Junior projects copy-paste; senior projects parameterize.

The senior dbt anti-pattern to avoid in the interview: presenting a project where every model is in one folder, every transformation happens in marts, and the only tests are on primary keys. The interviewer reads "this person used dbt because the company adopted it, not because they internalized the discipline."

Performance discipline and the query plan

The mid-vs-senior tell on a slow query is what the engineer does in the first five minutes. The mid rewrites clauses, adds indexes, swaps joins for IN-clauses; instinct over evidence. The senior opens the query plan.

The plan-reading instinct on each major engine:

  • Postgres-family (Redshift, Aurora, plain Postgres). EXPLAIN ANALYZE produces actual row counts plus actual timings per operator. The senior reads bottom-up, looks for the row-count divergence between estimated and actual (the planner trusting bad statistics is the most common slow-query root cause), and watches for hash-join spills to disk or sequential scans on filtered columns.
  • BigQuery. The query plan UI shows stages, slot time per stage, bytes processed, and bytes shuffled. The senior reads slot time first (where the engine actually spent compute), bytes-shuffled second (the cross-stage data movement that costs the most on large queries). A query that processes 100GB and shuffles 80GB has join-key skew; a query that processes 100GB and shuffles 5GB is well-planned.
  • Snowflake. The query profile shows operator-level execution with bytes scanned, percentage scanned from cache, and partitions scanned vs total. The senior watches partitions-scanned; a query that scans most partitions when the filter should prune to a thin slice has either a non-clustering-aligned filter or a function on the partition column that defeats pruning.
  • Databricks (Spark). The Spark UI exposes the DAG, the stage durations, the shuffle read and write sizes, and the executor-level skew distribution. The senior watches for stages where one executor takes ten times the time of the median; that is the skew signal.

The performance anti-patterns the senior catches on every code review:

  • SELECT star in a production pipeline. Forces the engine to read every column of every row, defeats column-pruning, and breaks downstream contracts when the source table schema changes. Acceptable in ad-hoc exploration; unacceptable in a dbt model or production view.
  • Missing WHERE clause on a time-partitioned table. Full-table scan; on BigQuery and Snowflake this is a billable mistake; on most warehouses it is a multi-hour query. The senior tell: every query against a partitioned table has a partition-column filter, even when it looks redundant.
  • UDFs that block predicate pushdown. A user-defined function in the WHERE clause prevents the engine from pushing the filter down to the storage layer. The query reads the full table, applies the UDF in memory, then filters. The senior rewrites the UDF into native SQL or pushes the predicate up the query.
  • LATERAL VIEW EXPLODE on unbounded arrays. Cardinality explosion. A column with arrays of length 100 turns a 1B-row table into a 100B-row intermediate. The senior tell: bounding the explode with a LIMIT or pre-filtering the arrays before the explode.
  • COUNT(DISTINCT) on huge cardinality. Exact distinct counting is expensive at scale. Approximate counting (APPROX_COUNT_DISTINCT, HyperLogLog) trades accuracy for orders-of-magnitude speedup. The senior knows when the accuracy trade is acceptable (dashboard counts, exploratory analysis) and when it is not (revenue reporting).

The thing the senior never says: "the query is slow because the warehouse is slow." The thing the senior says: "the plan shows a broadcast join that should have been sort-merge; the small side is 12GB and we are running it on default broadcast threshold of 10MB; let me add a hint."

Frequently asked questions

How important is SQL relative to Python at senior product DS?
At product / analytics DS shops (Meta, Airbnb, Netflix Studio, Stripe), SQL is more important than Python in day-to-day work and more important in interviews. The metric definition lives in SQL. The cohort analysis lives in SQL. The pre-experiment audit lives in SQL. Python is the readout layer (pandas, plotting, statsmodels for the test). A senior product DS who is strong on Python and weak on SQL fails the screen; the reverse pattern is more forgivable because Python skills transfer from any engineering background.
Do I need to memorize syntax for multiple SQL dialects?
No, but you need to be able to read any of them and know the dialect-specific gotchas of the one you write. The cross-dialect surface (window functions, CTEs, joins, GROUP BY) is portable. The dialect-specific surface (BigQuery STRUCT and UNNEST, Snowflake VARIANT and FLATTEN, Postgres-style array operators, Spark SQL pivot syntax) is what surprises engineers moving between shops. Senior candidates name the engine they have written the most against, demonstrate fluency there, and read the others.
Is the Activity Schema the future, or is Kimball still right?
Kimball is still the production default in 2026, but the Activity Schema is the most interesting recent alternative. The pragmatic answer at the senior bar: dimensional modeling for the marts that BI tools query and that PMs read on dashboards; Activity Schema or a wide events table for the rapid-iteration analyst workflow that needs minimum modeling friction. They coexist; they are not in opposition. A senior who treats this as a religious war misses the point.
How deep does the senior bar go on dbt?
Deep enough to design the project structure, not just to add a model. At staff+, the bar is opinions on how to organize models (subject-area folders vs layer folders), conventions for naming (stg, int, fct, dim), the tests that should be on every fact and dim grain, the macros worth writing vs the patterns that should stay as copy-paste, and the seam between dbt and the orchestrator (Airflow, Dagster, Prefect). dbt fluency at the model level is mid-level; dbt fluency at the project-architecture level is senior.
What is the right answer when the SQL interviewer asks 'how would you optimize this query?'
Not a rewrite. The right first move is 'I would read the plan.' Then the conversation: which operator is the bottleneck, what does the row-count divergence between estimated and actual say about the statistics, where is the shuffle coming from. The interviewer is grading whether the candidate has the plan-reading instinct or whether the candidate's optimization vocabulary is folklore. Folklore answers (add an index, rewrite as a CTE, change INNER to LEFT) without plan evidence get marked as mid-level instinct.
Are graph databases or NoSQL relevant for senior product DS?
Rarely. The dominant 2026 stack at analytics-DS shops is a columnar cloud warehouse (BigQuery, Snowflake, Databricks SQL, Redshift) plus dbt plus a BI tool. Graph databases (Neo4j, Neptune) appear in specific use cases (fraud-network analysis, recommendation graph traversal) but are not on the critical path for most senior product DS roles. NoSQL stores (DynamoDB, MongoDB) appear in the operational layer; analysts almost never query them directly. Senior product DS interview prep should optimize for warehouse SQL fluency; the graph and NoSQL surface is bonus.

Sources

  1. Kimball Group; the canonical dimensional-modeling reference. Star schema, fact and dim tables, slowly-changing dimensions.
  2. Stack Overflow Developer Survey (annual); SQL is consistently among the most-used languages by professional developers and data professionals.
  3. U.S. Bureau of Labor Statistics; Data Scientists occupational profile. Median wage and projected growth (SOC 15-2051).
  4. levels.fyi data-scientist track; per-company compensation self-reports by level (2026).
  5. Evan Miller; A/B testing sample-size calculator. Cohort sizing for retention metrics.
  6. Google Cloud; BigQuery performance best practices. Partition pruning, clustering, query plan reading.
  7. Snowflake; clustering keys and micro-partitions. Zone-map pruning mechanics.
  8. dbt Labs documentation; staging-intermediate-marts layering, ref() macro, tests, snapshots for SCD Type 2.

About the author. Blake Crosley founded ResumeGeni and writes about data science, machine learning, hiring technology, and ATS optimization. More writing at blakecrosley.com.