Data Scientist / ML Engineer Hub

ML Fundamentals for Data Scientists / ML Engineers (2026)

In short

ML fundamentals are the foundation that every interview round at FAANG-tier and AI-lab MLE stages probes. The bar in 2026 has not changed materially in the past decade: bias-variance trade-off, regularization (L1/L2, dropout, early stopping), gradient-based optimization (SGD, Adam, AdamW), tree-based methods (gradient boosting via XGBoost / LightGBM / CatBoost) for tabular work, and basic neural-network architecture for unstructured data. The senior bar adds: opinion on when to use what, what fails on small data vs large data, and which regularization technique is right for which architecture.

Key takeaways

  • Bias-variance trade-off is the canonical ML interview question. The senior bar: articulate the trade-off, name two regularization techniques per architecture (L2 + dropout for neural nets; max_depth + min_child_weight for XGBoost), and explain when underfitting vs overfitting is the binding constraint.
  • Gradient boosting (XGBoost github.com/dmlc/xgboost, LightGBM github.com/microsoft/LightGBM, CatBoost catboost.ai) is the dominant approach for tabular ML at most companies. Real production deployments at Stripe, Uber, Airbnb, and many fintech companies are XGBoost-based; junior MLE candidates with no XGBoost experience are at a disadvantage.
  • Neural-net training fundamentals: SGD vs Adam vs AdamW (Loshchilov & Hutter, 2019, arxiv.org/abs/1711.05101), learning-rate scheduling (cosine schedule, warmup), batch normalization vs layer normalization, and the role of weight initialization. The 'Deep Learning' book by Goodfellow, Bengio, Courville (deeplearningbook.org) is the canonical reference.
  • Cross-validation methodology matters. Time-series cross-validation (forward-chaining splits) is required for any model where data has temporal structure; k-fold CV without time-awareness produces overly-optimistic estimates. The scikit-learn TimeSeriesSplit (scikit-learn.org/stable/modules/cross_validation.html) is the canonical reference.
  • The bias-variance / regularization conversation extends to LLM fine-tuning in 2026. PEFT methods (LoRA, prefix-tuning) are regularization techniques — they reduce parameter count and constrain the fine-tune to a low-dimensional subspace, mitigating catastrophic forgetting. The canonical paper is Hu et al. 'LoRA' (2021, arxiv.org/abs/2106.09685).

Bias-variance: the canonical ML interview question

Every FAANG-tier ML interview probes bias-variance trade-off. The textbook framing: total error = bias-squared + variance + irreducible-error. High-bias models underfit (decision tree with depth 1); high-variance models overfit (decision tree with depth 50). The senior bar is to articulate the trade-off in terms of the specific architecture you are using.

For tree-based gradient boosting (XGBoost, LightGBM, CatBoost):

  • Bias controls. Increase max_depth (more complex trees), increase n_estimators (more boosting rounds), reduce min_child_weight / min_data_in_leaf (allow more leaf nodes).
  • Variance controls. Reduce max_depth, reduce n_estimators, increase min_child_weight, add column-subsampling (colsample_bytree), increase L2 regularization (lambda).
  • Diagnostic. Train and validation loss diverging means high variance, regularize. Train and validation loss both high means high bias, increase capacity.

For neural networks (PyTorch / JAX), three orthogonal regularization techniques:

import torch.nn as nn
# High-variance fix: dropout regularization
self.fc1 = nn.Linear(input_dim, 256)
self.dropout1 = nn.Dropout(0.3)  # zero out 30% of activations
self.fc2 = nn.Linear(256, output_dim)

# In training loop: weight-decay (L2 regularization on parameters)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# Early stopping on validation loss is a third regularizer.

The senior interview answer: name three orthogonal regularization techniques for the architecture, articulate when each binds (dropout binds when activations are too rich; weight-decay binds when parameters are too large; early stopping binds when training-set fit exceeds generalization).

Gradient boosting: the production-ML workhorse

Gradient boosting is the dominant approach for tabular ML in 2026. XGBoost (Chen & Guestrin, 2016, arxiv.org/abs/1603.02754) is the most-deployed; LightGBM (Microsoft) and CatBoost (Yandex) are alternatives with different optimization choices.

A canonical production pattern at fintech, ad-tech, and recommendation companies:

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit

# 1. Time-series cross-validation (NEVER use simple k-fold for temporal data)
tscv = TimeSeriesSplit(n_splits=5, gap=7)  # 7-day gap to avoid leakage
for train_idx, val_idx in tscv.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # 2. Train with early stopping on the validation fold
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "max_depth": 6,                # bias-variance: 6 is a strong default
        "min_child_weight": 5,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "learning_rate": 0.05,
        "reg_lambda": 1.0,
    }
    model = xgb.train(
        params, dtrain, num_boost_round=2000,
        evals=[(dval, "val")],
        early_stopping_rounds=50,
    )

# 3. Production deployment: persist with native XGBoost JSON format
model.save_model("model.json")  # JSON for cross-version stability

The senior MLE conversation around this code: why the 7-day gap? (Gap prevents leakage from features computed from future data.) Why time-series split rather than k-fold? (k-fold over temporal data trains on future and validates on past, producing optimistic estimates.) Why save_model JSON rather than a binary serializer? (Binary formats are fragile across XGBoost versions; JSON is the documented stable format.)

Real-world hyperparameter ranges that work well across many tabular problems: max_depth 4–8, learning_rate 0.01–0.1, num_boost_round 500–3000 with early stopping, subsample / colsample 0.7–0.9. Senior MLE candidates can articulate why these defaults work and when they need to change.

Neural-net optimization: SGD, Adam, AdamW

The optimization landscape in 2026: SGD with momentum (Nesterov 1983) is still used for ConvNets at scale; Adam (Kingma & Ba 2015, arxiv.org/abs/1412.6980) is the default for transformers; AdamW (Loshchilov & Hutter 2019, arxiv.org/abs/1711.05101) is the default at AI-labs for foundation-model training.

Why AdamW over Adam: Adam couples weight decay with the per-parameter adaptive learning rate, which causes weight-decay to depend on the parameter magnitude in a way that breaks regularization. AdamW decouples them, making weight-decay behave the way an L2 regularizer should. Practical implication: AdamW gives better generalization than Adam at the same hyperparameters on transformer-scale training.

import torch.optim as optim

# Standard Adam — coupled weight decay, weaker regularization
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)

# AdamW — decoupled weight decay, the default at AI-labs in 2026
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# With cosine learning rate schedule and warmup (the foundation-model standard)
from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR

warmup_steps = 1000
total_steps = 100000
warmup = LambdaLR(optimizer, lr_lambda=lambda s: min(s / warmup_steps, 1.0))
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps)
# In practice, transformers use SequentialLR to chain warmup -> cosine

The senior MLE conversation: why warmup? (At training start, gradient estimates are noisy; large learning rates cause divergence. Warmup ramps the LR up over the first 1–5% of training.) Why cosine schedule rather than constant LR? (Empirically produces better final loss across many transformer training runs; the original Llama and OPT papers used cosine schedules.) Why decoupled weight decay? (See Loshchilov & Hutter for the formal argument; the empirical answer is that it makes regularization more controllable across architectures.)

Cross-validation: time-series, group, and stratified

Cross-validation methodology is one of the most common interview blind-spots for junior MLE candidates. The wrong methodology produces overly-optimistic estimates that do not hold in production.

  • Standard k-fold. Use only when data points are truly independent. Most production data has at least one of: temporal correlation, group structure (multiple observations per user), or distribution shift. Naive k-fold across these structures produces leakage.
  • TimeSeriesSplit (forward-chaining). Use when data has temporal structure. Each fold trains on past, validates on future. Optionally include a gap between train and validation to avoid features-computed-from-future-data leakage. Real production: at Stripe, Uber, and most fintech, TimeSeriesSplit is the default.
  • GroupKFold. Use when data has group structure (multiple rows per user, per session, per device). Splits ensure that all rows for a given group are in the same fold — preventing the model from memorizing individual users.
  • StratifiedKFold. Use for classification with class imbalance. Each fold has roughly the same class distribution as the overall data.
  • Combinations. Real production often combines: a time-series split with a group constraint (e.g., users registered in the same month must be in the same fold). The scikit-learn `GroupTimeSeriesSplit` does not exist out of the box; teams typically implement custom splitters.

Canonical reference: the scikit-learn cross-validation guide (scikit-learn.org/stable/modules/cross_validation.html) is the canonical introduction. The senior bar: identify the leakage risks in a given dataset and pick a CV strategy that prevents them.

Frequently asked questions

Should I use XGBoost, LightGBM, or CatBoost?
All three are credible. XGBoost is the most-deployed and best-documented (Chen & Guestrin 2016, github.com/dmlc/xgboost); LightGBM (github.com/microsoft/LightGBM) is faster on large datasets due to histogram-based splits; CatBoost (catboost.ai) handles categorical features natively without one-hot encoding. Real production patterns: most teams pick one and stick with it for cross-team tooling consistency. Performance differences across the three on well-tuned models are typically <1% AUC.
When do I use neural nets vs gradient boosting?
Tabular data with <100k rows and <100 features means gradient boosting almost always wins. Tabular data with millions of rows and complex feature interactions still usually favors gradient boosting, but neural nets become competitive. Unstructured data (text, images, audio) requires neural nets. The Borisov et al. 'Deep Neural Networks and Tabular Data: A Survey' (2022, arxiv.org/abs/2110.01889) is the canonical empirical reference; their survey shows gradient boosting wins on most tabular benchmarks.
How do I handle class imbalance?
Three standard approaches. (1) Resample — oversample the minority class (SMOTE, github.com/scikit-learn-contrib/imbalanced-learn) or undersample the majority. (2) Reweight the loss — class_weight='balanced' in scikit-learn, scale_pos_weight in XGBoost. (3) Pick a metric that handles imbalance — precision-recall AUC instead of ROC AUC, F1 instead of accuracy. The right choice depends on the deployment context: if false-positives are costly, optimize precision; if false-negatives are costly, optimize recall. Stripe's fraud-detection blog posts have good worked examples.
What's the right validation strategy for time-series data?
TimeSeriesSplit (forward-chaining cross-validation). Each fold trains on past, validates on future. Include a temporal gap between train and validation to avoid leakage from features that look back. Don't use simple k-fold — it allows the model to train on future and validate on past, producing optimistic estimates that don't hold in production. The scikit-learn TimeSeriesSplit and the Walk-Forward Validation pattern (commonly used in finance and forecasting) are canonical.
What are the canonical neural-net regularization techniques?
Five orthogonal techniques. (1) Weight decay (L2 on parameters via AdamW). (2) Dropout (zero random activations during training, Srivastava et al. 2014). (3) Batch normalization (Ioffe & Szegedy 2015) for ConvNets / layer normalization for transformers. (4) Early stopping on validation loss. (5) Data augmentation (the most underrated regularizer for vision and audio). The senior bar is articulating which combination is right for which architecture and what the failure modes are when you over-regularize.
How does PEFT / LoRA fit into the bias-variance picture?
PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA are regularization techniques. By constraining the fine-tune to a low-rank subspace of the parameter space, you reduce variance (the fine-tune has fewer effective degrees of freedom) at the cost of some bias (the constrained subspace may not contain the optimal fine-tuned model). The trade-off in practice: LoRA produces fine-tunes that generalize better and avoid catastrophic forgetting, at the cost of slightly weaker peak performance vs full fine-tune. Hu et al. 'LoRA' (2021, arxiv.org/abs/2106.09685) is the canonical paper.

Sources

  1. Goodfellow, Bengio, Courville — Deep Learning (canonical textbook).
  2. Chen & Guestrin — XGBoost: A Scalable Tree Boosting System (KDD 2016).
  3. Loshchilov & Hutter — Decoupled Weight Decay Regularization (AdamW, ICLR 2019).
  4. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2021).
  5. scikit-learn — cross-validation user guide.
  6. Borisov et al. — Deep Neural Networks and Tabular Data: A Survey (2022).

About the author. Blake Crosley founded ResumeGeni and writes about data science, machine learning, hiring technology, and ATS optimization. More writing at blakecrosley.com.