Data Scientist / ML Engineer Hub
Statistics and Experimentation for Data Scientists (2026)
In short
Statistics and experimentation are the foundation of product data science in 2026. Senior product-DS interviews at FAANG-tier and large-AI-lab companies typically test correct power analysis (commonly power=0.8 at alpha=0.05), application of variance-reduction techniques like CUPED (Controlled experiments using Pre-Experiment Data, Microsoft 2013), Sample Ratio Mismatch detection as a quality gate, and reasoning about multiple-comparisons corrections. The bar at senior+ is not 'I can run a t-test'; it's 'I can design an experiment whose conclusion will hold up under replication, articulate the failure modes of common-practice experimentation, and explain why we got a wrong result when we did.'
Key takeaways
- Power analysis is non-negotiable. The common convention is power=0.8 at alpha=0.05; for senior DS roles, you should be able to compute the minimum detectable effect (MDE) given a sample size, or the required sample size given an MDE. Evan Miller's calculator (evanmiller.org/ab-testing/sample-size.html) is the canonical reference for the simple-proportions case.
- CUPED (Controlled experiments using Pre-Experiment Data, Deng et al. Microsoft 2013) is the canonical variance-reduction technique. It uses pre-experiment metric values as covariates to reduce variance materially. Per the original paper, several user-level activity metrics see variance roughly halved (~0.45–0.50 fractional reduction); revenue-per-user is the named exception with much smaller gains because per-user revenue auto-correlation across periods is low. Real adoption: Microsoft published CUPED; DoorDash published the related CUPAC method; Airbnb and Netflix have published related experiment-sensitivity work. Treat exact lift ranges as platform-specific.
- Sample Ratio Mismatch (SRM) is the canonical quality gate. If the actual ratio of users in treatment vs control deviates from the intended split beyond a pre-set chi-square / z-test threshold, the experiment's results cannot be trusted. Mature platforms calibrate the threshold to their experiment volume rather than relying on a universal sigma rule, and treat SRM-flagged experiments as untrustworthy regardless of headline metric.
- Multiple-comparisons corrections matter when running many tests. Bonferroni correction (alpha / N) is the most conservative; Benjamini-Hochberg false-discovery-rate correction is more powerful for many tests. Senior DS interviews probe this; candidates who claim 'we ran 50 metrics and 3 were significant at 5%' without correction fail the round.
- Sequential testing (group-sequential designs, alpha-spending functions, mSPRT) lets you peek at experiments without inflating false-positive rate. Real production frameworks: Optimizely's Stats Engine (optimizely.com/insights/blog/stats-engine), the canonical academic reference is the alpha-spending literature (Lan & DeMets 1983).
Power analysis: what senior DS actually do
Power analysis at senior+ DS is not a textbook exercise; it is a practical conversation about whether the experiment is worth running. The common convention; power=0.8 at alpha=0.05; answers: given the true effect size we expect, what is the probability we would detect it? The conversation senior DS have is the inverse:
- Given the sample size we can realistically afford, what is the minimum detectable effect (MDE)? If the MDE is larger than any plausible business effect, the experiment is not worth running; it can only fail to reject the null.
- What is the variance of the metric in the population? Conversion rate (binary) has lower variance than continuous revenue per user. Senior DS are expected to know which transformations reduce variance; log-transform for revenue, CUPED with pre-experiment data, capping outliers above the 99th percentile.
- What is the minimum runtime? Weekly seasonality typically forces a minimum 14-day runtime; for some metrics (retention, monetization) the minimum is 28 days or longer.
A worked example. Suppose you are running an experiment on a search-ranking change at a streaming platform with 100M monthly active users. Treatment is 5% of users for 14 days. Primary metric is 28-day retention (binary, baseline 78%, variance ~0.78×0.22=0.17). MDE at 80% power, 5% alpha works out to roughly 0.09% relative; the experiment is well-powered. If the same experiment ran on 5% of a 1M MAU product, MDE would balloon to ~1.5%, larger than most plausible UI changes; the experiment cannot fail-to-reject in any informative way. Senior DS conversation: this experiment is not worth running on a smaller user base; consider increasing the rollout, lengthening the duration, or running a between-version comparison instead. Evan Miller's sample-size calculator (evanmiller.org/ab-testing/sample-size.html) is the canonical reference.
CUPED variance reduction: a worked example
CUPED (Controlled experiments using Pre-Experiment Data) is the variance-reduction technique most widely deployed in 2026 experimentation platforms. Deng, Xu, Kohavi, Walker (Microsoft, 2013) is the canonical paper (exp-platform.com). The intuition:
If a user's pre-experiment behavior is correlated with their during-experiment behavior, you can use the pre-period as a covariate to reduce metric variance. The CUPED-adjusted metric is Y_adjusted = Y - theta * (X - X_mean) where theta = Cov(Y, X) / Var(X), Y is the during-experiment metric, and X is the pre-experiment metric. Real-world variance reduction in production, per the original paper: roughly 45-50 percent on user-level activity metrics; revenue-per-user is the named exception at under 5 percent because per-user revenue auto-correlation across periods is low. The variance reduction tracks the autocorrelation of the metric, not the metric type per se.
Public adoption: Microsoft published CUPED; DoorDash published the related CUPAC method (Control Using Predictions As Covariate, generalizing CUPED to ML-derived covariates); Airbnb has published related experiment-sensitivity work; Netflix has published related variance-reduction work in their tech blog. Treat exact lift ranges as platform-specific; the broad pattern is well-supported.
Implementation pattern in practice:
- Compute the pre-experiment metric (typically last 14 or 28 days before randomization).
- Compute theta on held-out data (do not fit on the experiment data; use a previous period).
- Apply CUPED adjustment to the during-experiment metric.
- Run the t-test on the adjusted metric.
Senior DS interview prompt: your team's primary metric is too noisy to detect business-relevant changes in 14-day experiments. What variance-reduction techniques would you apply, and how would you validate they do not bias the experiment? The expected answer covers CUPED, stratified randomization, MV-RA (multi-variate regression adjustment), and capping outliers.
Sample Ratio Mismatch (SRM) and other quality gates
Sample Ratio Mismatch (SRM) is the canonical experiment-quality gate. The principle: if you intended to assign users 50/50 to treatment and control, but the actual split is 50.3/49.7 (with millions of users), the deviation is statistically impossible by chance alone; something is wrong with your randomization, your data pipeline, or your tracking. Microsoft's canonical paper (Fabijan, Gupchup, Gupta, et al. 2019, exp-platform.com) walks through real-world SRM cases at Microsoft.
Real-world SRM root causes:
- Triggering disparity. Treatment ships a feature that takes 200ms longer to load; users who bounce before triggering are over-represented in control vs treatment.
- Tracking disparity. Treatment instruments a different event; users in treatment without the event are dropped from analysis. Mid-2010s LinkedIn reported a SRM caused by a logging bug in the treatment branch.
- Bot filtering disparity. Bot detection rules apply differently to the two arms; bot-traffic users are over-represented in one arm.
Detection: chi-squared (or equivalent z-test) at a pre-set significance threshold calibrated to the platform's experiment volume; mature platforms gate SRM at a tight p-value rather than relying on a universal sigma rule. SRM-flagged experiments are aborted; the conclusion cannot be trusted regardless of the headline metric. The senior DS conversation: we got a meaningful lift on the headline metric, but SRM flagged; the lift is unreliable, and we need to find the root cause before we can interpret.
Other quality gates: stratification balance (do covariates match across arms?), trigger-time variance (are users in the two arms exposed at the same time of day?), and metric-cardinality stability (does the user-count dropoff from impression to conversion match across arms?).
Sequential testing and the peeking problem
The peeking problem: if you run an A/B test, peek at the results halfway through, and stop early when significant, you have inflated your false-positive rate. With ten equally spaced uncorrected looks at a two-sided 5 percent threshold, the false-positive rate runs roughly 19 percent in a simple null simulation; with frequent optional stopping it can climb past 25 percent (see Evan Miller, "How Not to Run an A/B Test"). Sequential testing methodologies fix this.
- Group-sequential designs. Pre-specify N peeking checkpoints; use alpha-spending functions to allocate the total alpha budget across checkpoints. Lan & DeMets (1983) is the canonical reference. At each checkpoint, the threshold for significance is tighter than the nominal alpha; the cumulative false-positive rate stays at 5%.
- mSPRT (mixture Sequential Probability Ratio Test). Always-valid p-values that allow continuous peeking. Optimizely's Stats Engine (Optimizely Stats Engine) uses mSPRT in production. Trade-off: tighter thresholds early; weaker statistical efficiency than fixed-horizon tests at the planned end.
- Bayesian credible intervals. Some companies use Bayesian decision rules for experimentation; Convoy has published their Bayesian-AB-testing framework at primary source. The framework: define prior on effect size; update as data arrives; stop when credible interval excludes zero with sufficient probability.
Senior DS interview prompt: your stakeholder wants to peek at results daily and stop early when significant. How do you design the experiment so this is statistically valid? Expected answer: pre-specified sequential design with alpha-spending function (e.g., O'Brien-Fleming boundaries), or mSPRT-based always-valid p-values, or Bayesian credible intervals with stopping rule.
Frequently asked questions
- How important is causal inference vs A/B testing at senior DS?
- Both are required. A/B testing handles experimentation; causal inference handles observational analyses where randomization isn't possible. The Meta and Airbnb senior DS interviews probe both. A senior DS at Meta might use A/B testing for a product change and use difference-in-differences for an analysis of when a new policy was rolled out across markets. The right framing: A/B testing is the gold standard; causal inference is the toolkit when A/B testing isn't an option.
- Should I use frequentist or Bayesian methods?
- Most production experimentation in 2026 is frequentist (t-tests, chi-squared, sequential designs). Bayesian methods are used at some companies (Convoy, a few Stripe teams, several startups with statistician-led teams) but the dominant pattern is frequentist with CUPED-style variance reduction. The Bayesian case strengthens for sequential decision-making and rare events; the frequentist case strengthens for well-powered standard A/B tests.
- What's the canonical reference for experimentation methodology?
- Three canonical references. (1) 'Trustworthy Online Controlled Experiments' (Kohavi, Tang, Xu, 2020; Cambridge University Press). The most-cited practitioners' textbook. (2) The Microsoft ExP Platform papers at exp-platform.com; the original CUPED paper, the SRM paper, the Twyman's-law paper. (3) The 'Causal Inference: What If' book by Hernán & Robins (free PDF at hsph.harvard.edu/miguel-hernan/causal-inference-book) for the observational-causal side.
- How do I handle multiple-metric and multiple-comparison problems?
- Two layers. (1) Per-experiment, if you're running many metric tests, use Benjamini-Hochberg false-discovery-rate correction. (2) Across experiments over time, accept that some experiments will spuriously hit significance; adopt a stricter shipping threshold (e.g., 1% alpha rather than 5%) for borderline-significant results, or require independent replication. Common practice across mature experimentation platforms is to pre-specify primary vs secondary metrics with stricter alpha for primary.
- What's the role of pre-experiment power analysis vs post-hoc?
- Pre-experiment is required; post-hoc 'observed power' is a statistical antipattern. The pre-experiment calculation answers: given the effect size we'd accept as meaningful, do we have enough sample? Post-hoc 'observed power' (computing power based on the observed effect size after the experiment) is a tautology; if the effect is significant, post-hoc power is high; if it isn't, post-hoc power is low. Andrew Gelman's blog (statmodeling.stat.columbia.edu) has the canonical critique.
- How do I detect whether my experiment platform's randomization is correct?
- Three checks. (1) AA tests; randomly split users 50/50 and confirm no metric shows significant difference at the expected 5% rate over many runs. (2) SRM detection on every experiment with chi-squared at a tight threshold. (3) Covariate balance check; for stratified designs, confirm pre-period covariates are balanced across arms within tolerance. AA-test failures and SRM detection are the gold-standard quality gates; mature experimentation platforms (Microsoft ExP, LinkedIn XLNT, Netflix's experimentation platform) run all three continuously.
Sources
- Kohavi, Tang, Xu; Trustworthy Online Controlled Experiments (2020). The canonical practitioners' textbook.
- Deng, Xu, Kohavi, Walker; CUPED (Microsoft 2013, the original variance-reduction paper).
- Fabijan et al.; Diagnosing Sample Ratio Mismatch (Microsoft 2019).
- Evan Miller; A/B testing sample-size calculator (canonical reference).
- Optimizely Stats Engine; mSPRT-based always-valid p-values in production.
- Hernán & Robins; Causal Inference: What If (free PDF, observational-causal reference).
About the author. Blake Crosley founded ResumeGeni and writes about data science, machine learning, hiring technology, and ATS optimization. More writing at blakecrosley.com.