Data Scientist / ML Engineer Hub

Statistics and Experimentation for Data Scientists (2026)

In short

Statistics and experimentation are the foundation of product data science in 2026. Senior DS at Meta, Airbnb, Netflix, and Stripe are expected to design A/B tests with correct power analysis (typically 80% at 5% alpha), apply variance-reduction techniques like CUPED (Controlled experiments using Pre-Experiment Data, Microsoft 2013), detect Sample Ratio Mismatch as a quality gate, and reason about multiple-comparisons corrections. The bar at senior+ is not 'I can run a t-test' — it's 'I can design an experiment whose conclusion will hold up under replication, articulate the failure modes of common-practice experimentation, and explain why we got a wrong result when we did.'

Key takeaways

  • Power analysis is non-negotiable. The standard convention is 80% power at 5% alpha; for senior DS roles, you should be able to compute the minimum detectable effect (MDE) given a sample size, or the required sample size given an MDE. Evan Miller's calculator (evanmiller.org/ab-testing/sample-size.html) is the canonical reference for the simple-proportions case.
  • CUPED (Controlled experiments using Pre-Experiment Data, Deng et al. Microsoft 2013) is the canonical variance-reduction technique. It uses pre-experiment metric values as covariates to reduce variance by 30–50% in many practical settings. Real adoption: Microsoft, LinkedIn, Netflix, Airbnb, Doordash all use CUPED-style variance reduction in production experimentation.
  • Sample Ratio Mismatch (SRM) is the canonical quality gate. If the actual ratio of users in treatment vs control deviates from the intended split (typically more than ~5σ from expected), the experiment's results cannot be trusted. SRM detection is a gating signal at every mature experimentation platform.
  • Multiple-comparisons corrections matter when running many tests. Bonferroni correction (alpha / N) is the most conservative; Benjamini-Hochberg false-discovery-rate correction is more powerful for many tests. Senior DS interviews probe this; candidates who claim 'we ran 50 metrics and 3 were significant at 5%' without correction fail the round.
  • Sequential testing (group-sequential designs, alpha-spending functions, mSPRT) lets you peek at experiments without inflating false-positive rate. Real production frameworks: Optimizely's Stats Engine (optimizely.com/insights/blog/stats-engine), the canonical academic reference is the alpha-spending literature (Lan & DeMets 1983).

Power analysis: what senior DS actually do

Power analysis at senior+ DS is not a textbook exercise; it is a practical conversation about whether the experiment is worth running. The standard convention — 80% power at 5% alpha — answers: given the true effect size we expect, what is the probability we would detect it? The conversation senior DS have is the inverse:

  • Given the sample size we can realistically afford, what is the minimum detectable effect (MDE)? If the MDE is larger than any plausible business effect, the experiment is not worth running — it can only fail to reject the null.
  • What is the variance of the metric in the population? Conversion rate (binary) has lower variance than continuous revenue per user. Senior DS are expected to know which transformations reduce variance — log-transform for revenue, CUPED with pre-experiment data, capping outliers above the 99th percentile.
  • What is the minimum runtime? Weekly seasonality typically forces a minimum 14-day runtime; for some metrics (retention, monetization) the minimum is 28 days or longer.

A worked example. Suppose you are running an experiment on a search-ranking change at a streaming platform with 100M monthly active users. Treatment is 5% of users for 14 days. Primary metric is 28-day retention (binary, baseline 78%, variance ~0.78×0.22=0.17). MDE at 80% power, 5% alpha works out to roughly 0.09% relative — the experiment is well-powered. If the same experiment ran on 5% of a 1M MAU product, MDE would balloon to ~1.5%, larger than most plausible UI changes — the experiment cannot fail-to-reject in any informative way. Senior DS conversation: this experiment is not worth running on a smaller user base; consider increasing the rollout, lengthening the duration, or running a between-version comparison instead. Evan Miller's sample-size calculator (evanmiller.org/ab-testing/sample-size.html) is the canonical reference.

CUPED variance reduction: a worked example

CUPED (Controlled experiments using Pre-Experiment Data) is the variance-reduction technique most widely deployed in 2026 experimentation platforms. Deng, Xu, Kohavi, Walker (Microsoft, 2013) is the canonical paper (exp-platform.com). The intuition:

If a user's pre-experiment behavior is correlated with their during-experiment behavior, you can use the pre-period as a covariate to reduce metric variance. The CUPED-adjusted metric is Y_adjusted = Y - theta * (X - X_mean) where theta = Cov(Y, X) / Var(X), Y is the during-experiment metric, and X is the pre-experiment metric. Real-world variance reduction in production: 30–50% on revenue-per-user (high autocorrelation), 10–20% on engagement metrics, near-zero on conversion-rate (low autocorrelation in many cases). Microsoft's ExP Platform team published variance-reduction figures of 50%+ in their original paper; LinkedIn and Doordash report similar magnitudes in their public engineering blogs.

Implementation pattern in practice (the Microsoft / LinkedIn / Netflix approach):

  1. Compute the pre-experiment metric (typically last 14 or 28 days before randomization).
  2. Compute theta on held-out data (do not fit on the experiment data — use a previous period).
  3. Apply CUPED adjustment to the during-experiment metric.
  4. Run the t-test on the adjusted metric.

Senior DS interview prompt: your team's primary metric has 35% variance — too noisy to detect business-relevant changes in 14-day experiments. What variance-reduction techniques would you apply, and how would you validate they do not bias the experiment? The expected answer covers CUPED, stratified randomization, MV-RA (multi-variate regression adjustment), and capping outliers.

Sample Ratio Mismatch (SRM) and other quality gates

Sample Ratio Mismatch (SRM) is the canonical experiment-quality gate. The principle: if you intended to assign users 50/50 to treatment and control, but the actual split is 50.3/49.7 (with millions of users), the deviation is statistically impossible by chance alone — something is wrong with your randomization, your data pipeline, or your tracking. Microsoft's canonical paper (Fabijan, Gupchup, Gupta, et al. 2019, exp-platform.com) walks through real-world SRM cases at Microsoft.

Real-world SRM root causes:

  • Triggering disparity. Treatment ships a feature that takes 200ms longer to load; users who bounce before triggering are over-represented in control vs treatment.
  • Tracking disparity. Treatment instruments a different event; users in treatment without the event are dropped from analysis. Mid-2010s LinkedIn reported a SRM caused by a logging bug in the treatment branch.
  • Bot filtering disparity. Bot detection rules apply differently to the two arms; bot-traffic users are over-represented in one arm.

Detection: chi-squared test with a very tight threshold (typically 5σ from expected ratio). At Meta and Microsoft scale, SRM-flagged experiments are aborted — the conclusion cannot be trusted regardless of the headline metric. The senior DS conversation: we got a 1.4% lift on the headline metric, but SRM flagged at 7σ — the lift is unreliable; we need to find the root cause before we can interpret.

Other quality gates: stratification balance (do covariates match across arms?), trigger-time variance (are users in the two arms exposed at the same time of day?), and metric-cardinality stability (does the user-count dropoff from impression to conversion match across arms?).

Sequential testing and the peeking problem

The peeking problem: if you run an A/B test, peek at the results halfway through, and stop early when significant, you have inflated your false-positive rate. The naive repeated-significance approach has a false-positive rate around 26% rather than the nominal 5% (for ten peeks). Sequential testing methodologies fix this.

  • Group-sequential designs. Pre-specify N peeking checkpoints; use alpha-spending functions to allocate the total alpha budget across checkpoints. Lan & DeMets (1983) is the canonical reference. At each checkpoint, the threshold for significance is tighter than the nominal alpha; the cumulative false-positive rate stays at 5%.
  • mSPRT (mixture Sequential Probability Ratio Test). Always-valid p-values that allow continuous peeking. Optimizely's Stats Engine (optimizely.com/insights/blog/stats-engine) uses mSPRT in production. Trade-off: tighter thresholds early; weaker statistical efficiency than fixed-horizon tests at the planned end.
  • Bayesian credible intervals. Some companies (notably Convoy and a few Stripe teams per their public blogs) use Bayesian decision rules for experimentation. The framework: define prior on effect size; update as data arrives; stop when credible interval excludes zero with sufficient probability.

Senior DS interview prompt: your stakeholder wants to peek at results daily and stop early when significant. How do you design the experiment so this is statistically valid? Expected answer: pre-specified sequential design with alpha-spending function (e.g., O'Brien-Fleming boundaries), or mSPRT-based always-valid p-values, or Bayesian credible intervals with stopping rule.

Frequently asked questions

How important is causal inference vs A/B testing at senior DS?
Both are required. A/B testing handles experimentation; causal inference handles observational analyses where randomization isn't possible. The Meta and Airbnb senior DS interviews probe both. A senior DS at Meta might use A/B testing for a product change and use difference-in-differences for an analysis of when a new policy was rolled out across markets. The right framing: A/B testing is the gold standard; causal inference is the toolkit when A/B testing isn't an option.
Should I use frequentist or Bayesian methods?
Most production experimentation in 2026 is frequentist (t-tests, chi-squared, sequential designs). Bayesian methods are used at some companies (Convoy, a few Stripe teams, several startups with statistician-led teams) but the dominant pattern is frequentist with CUPED-style variance reduction. The Bayesian case strengthens for sequential decision-making and rare events; the frequentist case strengthens for well-powered standard A/B tests.
What's the canonical reference for experimentation methodology?
Three canonical references. (1) 'Trustworthy Online Controlled Experiments' (Kohavi, Tang, Xu, 2020 — Cambridge University Press). The most-cited practitioners' textbook. (2) The Microsoft ExP Platform papers at exp-platform.com — the original CUPED paper, the SRM paper, the Twyman's-law paper. (3) The 'Causal Inference: What If' book by Hernán & Robins (free PDF at hsph.harvard.edu/miguel-hernan/causal-inference-book) for the observational-causal side.
How do I handle multiple-metric and multiple-comparison problems?
Two layers. (1) Per-experiment, if you're running many metric tests, use Benjamini-Hochberg false-discovery-rate correction. (2) Across experiments over time, accept that some experiments will spuriously hit significance — adopt a stricter shipping threshold (e.g., 1% alpha rather than 5%) for borderline-significant results, or require independent replication. Microsoft and LinkedIn both publicly recommend pre-specifying primary vs secondary metrics with stricter alpha for primary.
What's the role of pre-experiment power analysis vs post-hoc?
Pre-experiment is required; post-hoc 'observed power' is a statistical antipattern. The pre-experiment calculation answers: given the effect size we'd accept as meaningful, do we have enough sample? Post-hoc 'observed power' (computing power based on the observed effect size after the experiment) is a tautology — if the effect is significant, post-hoc power is high; if it isn't, post-hoc power is low. Andrew Gelman's blog (statmodeling.stat.columbia.edu) has the canonical critique.
How do I detect whether my experiment platform's randomization is correct?
Three checks. (1) AA tests — randomly split users 50/50 and confirm no metric shows significant difference at the expected 5% rate over many runs. (2) SRM detection on every experiment with chi-squared at a tight threshold. (3) Covariate balance check — for stratified designs, confirm pre-period covariates are balanced across arms within tolerance. AA-test failures and SRM detection are the gold-standard quality gates; mature experimentation platforms (Microsoft ExP, LinkedIn XLNT, Netflix's experimentation platform) run all three continuously.

Sources

  1. Kohavi, Tang, Xu — Trustworthy Online Controlled Experiments (2020). The canonical practitioners' textbook.
  2. Deng, Xu, Kohavi, Walker — CUPED (Microsoft 2013, the original variance-reduction paper).
  3. Fabijan et al. — Diagnosing Sample Ratio Mismatch (Microsoft 2019).
  4. Evan Miller — A/B testing sample-size calculator (canonical reference).
  5. Optimizely Stats Engine — mSPRT-based always-valid p-values in production.
  6. Hernán & Robins — Causal Inference: What If (free PDF, observational-causal reference).

About the author. Blake Crosley founded ResumeGeni and writes about data science, machine learning, hiring technology, and ATS optimization. More writing at blakecrosley.com.