Should I specialize in NLP, CV, or recommendations at mid?

Both specialization and generalist paths work at mid. Specialization pays at companies where the specialty is core — NLP/LLM at Anthropic, CV at Cruise/Waymo/Tesla, recommendations at Netflix/TikTok/Spotify. Generalist ML pays at companies with broad ML surface area (Google, Meta, Amazon). The risk of over-early specialization: an NLP specialist who has not built broad ML depth at mid will struggle to clear interviews at peer roles. The right pattern: build broad ML depth at mid, specialize into senior.

How important is causal inference at mid?

Increasingly weighted at analytics-DS shops. Meta and Airbnb explicitly hire for causal-inference depth at IC4+ — propensity scoring, instrumental variables, difference-in-differences, synthetic control. The canonical reference is Hernán & Robins, 'Causal Inference: What If' (free PDF at hsph.harvard.edu/miguel-hernan/causal-inference-book) and Susan Athey's NBER work (athey.people.stanford.edu). At AI labs, causal inference is less central; eval methodology is the closer cousin.

What's the dominant LLM stack at mid in 2026?

Two stacks dominate. (1) The fine-tune-an-open-model stack: Hugging Face transformers + peft (LoRA / QLoRA) + DeepSpeed or FSDP for distributed training, vLLM or TGI for inference, Weights & Biases for experiment tracking. (2) The frontier-API + structured-output stack: Anthropic Claude API or OpenAI Responses API with structured outputs, prompt-engineered with chain-of-thought + few-shot, RAG via a vector DB. Mid-level engineers are expected to articulate trade-offs between these — cost, latency, customization, evaluability — and pick the right one for the problem.

Do I need to know Spark and distributed data processing at mid?

Yes at most large tech companies. Spark (or its successor frameworks like Ray Data) is the dominant batch-data layer at FAANG-scale. PySpark fluency, understanding of Catalyst optimization, and ability to debug a slow query are mid-level expectations at every analytics-DS or large-scale-MLE role. Databricks is built on Spark; Netflix uses Spark via Iceberg; Uber runs on Spark + Pinot. The Spark documentation (spark.apache.org/docs/latest) and Databricks' free 'Apache Spark Programming with Databricks' course are the canonical references.

What gets you promoted from mid to senior?

Three patterns per public mid-to-senior promotion-case writeups at FAANG and AI-labs: (1) Lead at least one cross-team ML initiative — a project where you coordinate with engineers outside your immediate team. (2) Mentor at least one junior to the point where their work no longer needs you — measurable transfer. (3) Be the ML voice in cross-functional decisions — you're the one PM and design and senior eng come to with ML-shaped questions. Promotion takes 2–3 years from mid at most companies; engineers who try to promote in 18 months typically miss on the first attempt.

How much eval-design fluency is expected at mid at AI labs?

Substantial. Anthropic and OpenAI both publish research on evals (anthropic.com/research and openai.com/research) and explicitly hire mid-level engineers who can design a real eval. The bar: comfort designing a held-out eval set with documented inclusion criteria, ability to articulate the difference between accuracy and calibration and discrimination, fluency with adversarial-eval design (red-teaming for capability evals), and an opinion on the failure modes of common benchmarks (MMLU contamination, GSM8K leakage). The OpenAI evals repo (github.com/openai/evals) is canonical prep.

Should I learn Rust or C++ at mid for ML?

Optional but increasingly valuable. Rust is gaining at ML-systems companies (Hugging Face uses Rust for tokenizers, Anthropic and OpenAI have Rust components in their inference stacks). C++ remains essential at frame-level performance work — CUDA kernels, custom Triton extensions, low-level inference. The 80% case at mid-level is still Python; the 20% advantage is Rust or C++ for inference-systems work. The right pattern: Python-first depth, with one Rust or C++ project on GitHub to signal you can drop down when needed.

Data Scientist / ML Engineer Hub

Mid-Level Data Scientist / ML Engineer Guide (2026): What Senior Promotion Actually Looks Like

By Blake Crosley · Last verified 2026-04-29

In short

Mid-level data scientist or ML engineer (3–5 years) is where the workflow becomes self-sustaining: you scope your own analyses or experiments, drive technical decisions for your project area, partner with PMs and researchers end-to-end, and mentor juniors. FAANG-tier total comp clusters $250k–$380k per levels.fyi 2026 data; AI-labs (Anthropic MTS-3, OpenAI MTS-mid) sit $400k–$700k+. The transition to senior takes 2–3 years on average and is bottlenecked on three things — leading at least one cross-team initiative, designing an ML system end-to-end (data → model → eval → deployment → monitoring), and demonstrating LLM / foundation-model fluency in production code or research output.

Key takeaways

FAANG-tier mid total comp $250k–$380k per levels.fyi 2026; Meta E4 DS $300k–$420k (levels.fyi/companies/facebook/salaries/data-scientist), Google L4 MLE $300k–$430k (levels.fyi/companies/google/salaries/machine-learning-engineer), Anthropic MTS-3 $450k–$700k+ (anthropic.com/careers leveling).
Mid is where ML technical leadership starts: you drive ML-system architecture for your area, mentor juniors, own experiments end-to-end including eval design / deployment / monitoring / drift detection.
Senior promotion (2–3 years from mid at most large tech companies) is bottlenecked on 'cross-team impact' — a model or analysis that touches stakeholders outside your immediate team's ownership.
ML-system design is non-negotiable at mid+: data pipeline, feature store, model training loop, eval harness, deployment surface (batch vs online), monitoring (data drift, prediction drift, performance drift). 'Designing Machine Learning Systems' by Chip Huyen (huyenchip.com/ml-interviews-book) is the canonical prep.
LLM / foundation-model fluency is now expected at mid+: fine-tuning workflows (PEFT/LoRA via peft library), eval-set design, RAG architectures, prompt engineering with Claude/GPT/Llama. The 'Attention Is All You Need' (Vaswani et al., arxiv.org/abs/1706.03762) → Llama-4 / GPT-5 / Claude 4 progression should be conversational background.

What companies expect at mid

The day-to-day shifts from junior in three concrete ways:

Scope ownership. You're given a problem ('our recommendation model is plateauing on engagement, find the next 5%') not a ticket ('add this feature to the model'). You scope the work — interview the eng team, profile the existing model, propose three approaches with trade-offs, write a one-pager, get sign-off, then execute. Senior+ engineers review only the architecture-level decisions; you handle the implementation details.
Cross-functional partnership. PM partner conversations are direct, not mediated by a senior DS. Engineering partner conversations happen at planning. You're expected to push back on a metric definition that's wrong statistically and to articulate why in PM-readable language.
Mentoring at the analysis-review level. Juniors on your team route analyses to you for review. You leave the kind of dense feedback you used to receive — sample-size sanity, confounders, the right counterfactual, the right baseline model. The signal that you've internalized this: a junior on your team improves measurably under your review.

Five concrete capabilities that show up at mid+ in production:

Drive ML system design for your area. Train-vs-buy decisions (fine-tune vs API), batch-vs-online inference, feature-store strategy, eval harness design.
Partner with eng on deployment. Vertex AI vs SageMaker vs custom kubernetes; serving via Triton vs vLLM; batch via Spark / Beam; latency targets and capacity planning.
Mentor juniors via analysis review. Dense feedback that teaches the principle, not just the fix.
Show measurable outcomes. Lift on the north-star metric (with confidence interval), production accuracy / regret, p99 latency, cost per inference. You name the metrics in your project's success criteria.
Modern LLM / foundation-model fluency. Fine-tuning workflows (LoRA, QLoRA via the peft library), eval-set design, RAG architectures, prompt engineering, distillation. The Hugging Face PEFT library (github.com/huggingface/peft) and DeepSpeed (github.com/microsoft/DeepSpeed) are mid-level table-stakes.

What a mid-level ML project actually looks like, end-to-end

A worked example — a mid-level MLE at a FAANG-tier company building a 'document-summarization model for the help-center search experience' over a 12-week project:

Weeks 1–2. Scope and design. Profile the existing search experience. Identify three approaches: (a) fine-tune a small open model (Llama-3 8B or Qwen-2.5 7B) with LoRA on internal data, (b) prompt-engineer the Anthropic Claude API with a structured RAG layer, (c) train a custom T5-small from scratch. Write a one-pager with cost / latency / quality trade-offs. Recommendation: (a) for cost-control + customization; (b) as the v0 baseline for time-to-impact. Get sign-off from senior research-engineer + PM + eng partner.
Weeks 3–4. Eval-set design and v0. Build a 500-example held-out eval set with three signal types: factual accuracy (does the summary contain claims not in the source), faithfulness (does it omit critical info), and concision (length distribution). Stand up the Anthropic Claude v0 with a structured-output schema. Measure on the eval set. Establish the baseline: 73% factual / 81% faithful / mean-length 142 words.
Weeks 5–7. Fine-tune the open model. Generate 10k synthetic training pairs using Claude-as-teacher. Fine-tune Qwen-2.5-7B with LoRA on 8x H100s for 18 hours (cost: ~$280 in cloud compute, tracked in W&B). Re-eval. Outperforms baseline at v3: 78% factual / 84% faithful / mean-length 118 words. Decision: ship the open model for cost-control; keep Claude as the fallback.
Weeks 8–10. Productionize. Quantize to INT8 with bitsandbytes for inference cost reduction. Deploy via vLLM on the team's serving cluster. Build the monitoring layer: per-day eval-set replay, data-drift on input length / language distribution, prediction-drift on output length distribution, alert thresholds set at 2σ from baseline. Latency target: p99 ≤ 1.2s.
Weeks 11–12. A/B test and rollout. 5% rollout on the help-center for 14 days. Primary metric: search-task completion rate. Secondary: user-reported helpfulness. Result: +4.2% on completion (95% CI [+2.8%, +5.6%]). 100% rollout. Write the retrospective. The junior on the team picks up the next iteration with you reviewing.

What made this mid-level scope: the engineer scoped, designed, and shipped a model end-to-end touching data / training / eval / deployment / monitoring without senior intervention beyond the architecture review. The same problem at junior level would have been split into four tickets each scoped by a senior. Cost discipline ($280 of compute for v3, replacing a $40k/month API spend) is mid-level seasoning. Eval-set design before any training is the staff-track signal — most mid-level engineers skip it and pay later.

ML system design at mid: the canonical interview question

Mid-level ML interviews introduce ML system design rounds. The canonical 45-minute prompt has the same shape across FAANG-tier and AI-labs: 'Design a [recommendation system / ranking model / fraud detection / search relevance / content moderation / LLM-eval] for [company-shaped scenario].' The senior interviewer is grading on:

Problem framing. What's the metric? What's the latency budget? What's the false-positive cost vs false-negative cost? Mid-level candidates who jump to model architecture without nailing the framing fail this round at every FAANG.
Data pipeline. Where does training data come from? How is it labeled? What's the freshness requirement? What's the distribution shift between train and serve? The Chip Huyen book (huyenchip.com/ml-interviews-book) has the canonical scaffolding here.
Model architecture. Tabular: gradient-boosted trees (XGBoost / LightGBM) or DLRM. NLP: an open foundation model with PEFT fine-tuning, or a frontier API. Vision: a fine-tuned ViT or DINOv2. Recommendation: a two-tower architecture with an ANN index. The interviewer wants you to articulate why this architecture for this problem, not the trendy answer.
Eval methodology. Offline (train/val/test split, the right metric), online (A/B test design, sample-size calculation, holdout duration), counterfactual (off-policy correction for ranking systems). Eval-set leakage and Goodhart-violations are common probes.
Deployment and monitoring. Batch vs online inference, p50 / p99 latency, cost per inference, data drift detection (KL divergence on input distributions), model drift detection (rolling eval-set replay), alert thresholds. The 'shadow deployment' pattern is mid+ table-stakes at most large tech companies.

The mid-level signal: don't recite a textbook architecture. Articulate three trade-offs explicitly, pick one, defend it, then describe how you'd validate the choice with offline + online evidence. Hello Interview's ML system design walkthroughs (hellointerview.com/learn/ml-system-design) cover the canonical rubric.

Compensation: the real bands at mid

Total comp at mid FAANG-tier and AI-labs in 2026 (US, per levels.fyi):

Company	Level	Base	Total comp
Meta DS	E4	$170k–$220k	$300k–$420k
Google MLE	L4	$170k–$220k	$300k–$430k
Netflix MLE	L5	$310k–$380k	$380k–$580k (single-band)
Anthropic MTS	MTS-3 (mid)	$300k–$400k	$450k–$700k+
OpenAI MTS	MTS (mid)	$310k–$420k	$500k–$900k+ (heavy PPU)
Databricks MLE	L4	$190k–$250k	$310k–$480k
Scale AI	mid MLE	$210k–$280k	$340k–$560k
Hugging Face	mid	$170k–$220k (remote-friendly)	$240k–$380k

The structural fact at mid: AI-labs pay total comp that materially exceeds FAANG. OpenAI and Anthropic mid-level MTS commonly clears $500k+, with reported peak comp on PPU vesting cycles exceeding $1M in public levels.fyi reports. The risk: AI-lab comp is heavier on equity that depends on company outcome; FAANG comp is more diversified. The right framing for negotiations: don't compare nominal numbers; compare risk-adjusted expected value.

Frequently asked questions

Should I specialize in NLP, CV, or recommendations at mid?: Both specialization and generalist paths work at mid. Specialization pays at companies where the specialty is core — NLP/LLM at Anthropic, CV at Cruise/Waymo/Tesla, recommendations at Netflix/TikTok/Spotify. Generalist ML pays at companies with broad ML surface area (Google, Meta, Amazon). The risk of over-early specialization: an NLP specialist who has not built broad ML depth at mid will struggle to clear interviews at peer roles. The right pattern: build broad ML depth at mid, specialize into senior.
How important is causal inference at mid?: Increasingly weighted at analytics-DS shops. Meta and Airbnb explicitly hire for causal-inference depth at IC4+ — propensity scoring, instrumental variables, difference-in-differences, synthetic control. The canonical reference is Hernán & Robins, 'Causal Inference: What If' (free PDF at hsph.harvard.edu/miguel-hernan/causal-inference-book) and Susan Athey's NBER work (athey.people.stanford.edu). At AI labs, causal inference is less central; eval methodology is the closer cousin.
What's the dominant LLM stack at mid in 2026?: Two stacks dominate. (1) The fine-tune-an-open-model stack: Hugging Face transformers + peft (LoRA / QLoRA) + DeepSpeed or FSDP for distributed training, vLLM or TGI for inference, Weights & Biases for experiment tracking. (2) The frontier-API + structured-output stack: Anthropic Claude API or OpenAI Responses API with structured outputs, prompt-engineered with chain-of-thought + few-shot, RAG via a vector DB. Mid-level engineers are expected to articulate trade-offs between these — cost, latency, customization, evaluability — and pick the right one for the problem.
Do I need to know Spark and distributed data processing at mid?: Yes at most large tech companies. Spark (or its successor frameworks like Ray Data) is the dominant batch-data layer at FAANG-scale. PySpark fluency, understanding of Catalyst optimization, and ability to debug a slow query are mid-level expectations at every analytics-DS or large-scale-MLE role. Databricks is built on Spark; Netflix uses Spark via Iceberg; Uber runs on Spark + Pinot. The Spark documentation (spark.apache.org/docs/latest) and Databricks' free 'Apache Spark Programming with Databricks' course are the canonical references.
What gets you promoted from mid to senior?: Three patterns per public mid-to-senior promotion-case writeups at FAANG and AI-labs: (1) Lead at least one cross-team ML initiative — a project where you coordinate with engineers outside your immediate team. (2) Mentor at least one junior to the point where their work no longer needs you — measurable transfer. (3) Be the ML voice in cross-functional decisions — you're the one PM and design and senior eng come to with ML-shaped questions. Promotion takes 2–3 years from mid at most companies; engineers who try to promote in 18 months typically miss on the first attempt.
How much eval-design fluency is expected at mid at AI labs?: Substantial. Anthropic and OpenAI both publish research on evals (anthropic.com/research and openai.com/research) and explicitly hire mid-level engineers who can design a real eval. The bar: comfort designing a held-out eval set with documented inclusion criteria, ability to articulate the difference between accuracy and calibration and discrimination, fluency with adversarial-eval design (red-teaming for capability evals), and an opinion on the failure modes of common benchmarks (MMLU contamination, GSM8K leakage). The OpenAI evals repo (github.com/openai/evals) is canonical prep.
Should I learn Rust or C++ at mid for ML?: Optional but increasingly valuable. Rust is gaining at ML-systems companies (Hugging Face uses Rust for tokenizers, Anthropic and OpenAI have Rust components in their inference stacks). C++ remains essential at frame-level performance work — CUDA kernels, custom Triton extensions, low-level inference. The 80% case at mid-level is still Python; the 20% advantage is Rust or C++ for inference-systems work. The right pattern: Python-first depth, with one Rust or C++ project on GitHub to signal you can drop down when needed.

Sources

About the author. Blake Crosley founded ResumeGeni and writes about data science, machine learning, hiring technology, and ATS optimization. More writing at blakecrosley.com.