Data Engineer Hub

Data Engineering at Uber in 2026: Hudi's Birthplace

In short

Data engineering at Uber operates at one of the largest scales in the industry: hundreds of petabytes in the data lake, trillions of events per day across mobility and delivery, and a stack where multiple components were either invented or hardened at Uber. Apache Hudi, the incremental data lake table format, originated at Uber and is now an Apache top-level project. Apache Pinot powers user-facing real-time analytics. Trino (formerly PrestoSQL) handles interactive SQL across the lake. Cadence runs durable workflows. Michelangelo is the ML platform. DEs at Uber work across this stack and are expected to reason about correctness, latency, and cost at scales most companies never encounter.

Key takeaways

  • Apache Hudi was created at Uber in 2016 by Vinoth Chandar to bring incremental processing and upserts to the Hadoop data lake; it is now an Apache top-level project.
  • Uber runs Trino (formerly PrestoSQL) at petabyte+ scale for interactive analytics across the lake, with thousands of daily users.
  • Apache Pinot powers user-facing real-time analytics features like the rider and driver dashboards, with sub-second query latency on fresh streaming data.
  • Cadence is Uber's durable workflow engine, used internally for long-running orchestration; it inspired the Temporal fork.
  • Michelangelo is Uber's end-to-end ML platform, integrating feature engineering, training, deployment, and monitoring on the same lake.
  • Total comp for senior DEs (L5a/L5b) generally lands in the $320K-$480K range per Levels.fyi, with staff DEs pushing higher.
  • Uber's engineering blog (eng.uber.com) is one of the richest public sources on petabyte-scale data infrastructure design decisions.

DE at Uber in 2026

Uber is one of the few companies where the data engineering team operates at a scale that forces invention. Trips, deliveries, driver location pings, surge calculations, and fraud signals generate trillions of events per day. The data lake holds hundreds of petabytes. Latency requirements span from sub-second user-facing analytics (how many trips did I complete this week?) to multi-hour batch ETL feeding finance and ML training. That range is why so many components in the modern data stack were either born at Uber or significantly shaped by Uber's engineering team.

Apache Hudi is the most visible example. Vinoth Chandar and team built Hudi at Uber starting in 2016 to solve a concrete problem: HDFS-based data lakes could not efficiently handle upserts or incremental processing, which made trip data corrections and late-arriving events painful. Hudi introduced copy-on-write and merge-on-read table types, record-level indexing, and incremental queries. It was open-sourced in 2017, joined the Apache Incubator in 2019, and graduated to top-level in 2020. DEs at Uber work directly with Hudi tables every day, and many contribute upstream as part of their job.

Beyond Hudi, the team operates Trino at a scale few companies match, runs Apache Pinot to serve user-facing real-time analytics, and uses Cadence (which inspired the Temporal fork) for durable workflows. Michelangelo, Uber's ML platform, ties feature engineering, training, and serving into the same data infrastructure. New DEs typically join a domain team (Mobility, Delivery, Financial Products, Maps, Safety) and own a slice of the pipelines, tables, and real-time aggregations that power that domain.

Interview process

The DE loop at Uber in 2026 is typically five to six rounds after the recruiter screen. Most rounds are run on Zoom, with optional onsite for finalists in San Francisco, Sunnyvale, Seattle, or New York depending on team. The bar is high on distributed-systems reasoning because the scale is real, not hypothetical.

  • Recruiter screen (30 min): background, level calibration, team match, comp expectations.
  • Technical phone screen (60 min): two SQL problems on a trip-and-driver schema and one Python or Java data manipulation question. Expect questions on window functions, anti-joins, and skew.
  • SQL and data modeling (60 min): design a warehouse model for a new product surface (e.g. a delivery promotion engine). Star vs. snowflake, slowly changing dimensions, grain choices, and how the model handles late-arriving events on Hudi.
  • Pipeline / systems design (60 min): design a streaming or batch pipeline end-to-end. Common prompts: real-time ETA aggregations on Pinot, an incremental Hudi pipeline for trip corrections, or a feature pipeline feeding Michelangelo. Backfill, idempotency, schema evolution, and cost are all fair game.
  • Coding (60 min): a medium-difficulty algorithmic problem in Python, Java, or Scala. Not LeetCode-hard, but tight on correctness and edge cases.
  • Behavioral / leadership (45-60 min): Uber's cultural norms, conflict, ambiguity, and a deep-dive on a project the candidate owned end-to-end. Staff candidates also get a cross-functional round.

Strong signal in the systems-design and modeling rounds is what differentiates senior offers from staff. Knowing Hudi, Trino, or Pinot internals is not required, but candidates who can reason about lake-versus-OLAP trade-offs stand out.

Compensation by level

Uber levels DEs on the same IC ladder as software engineers: L3 (entry), L4 (mid), L5a (senior), L5b (senior II), L6 (staff), L7 (senior staff). Per Levels.fyi data for the Data Engineer role in 2026, total compensation ranges roughly as follows for Bay Area candidates. Numbers below are directional and shift with stock price, offer cycle, and team criticality.

  • L4 (mid): ~$220K-$280K total (base around $170K-$190K, equity vesting over four years, target bonus ~15%).
  • L5a (senior): ~$300K-$380K total, the most common offer band for experienced DEs.
  • L5b (senior II): ~$360K-$450K total, with equity becoming a larger component.
  • L6 (staff): ~$450K-$600K+ total, heavily weighted toward RSUs.
  • L7 (senior staff): $600K+ total, with significant refresher cadence.

Uber's benefits include health, 401(k) match, an Uber/Uber Eats credit, and standard parental leave. Refreshers at performance reviews are meaningful and often close the gap between offer and steady-state TC. Stock has been volatile historically; candidates should ask for vesting cliffs and refresher cadence in writing.

Tech stack: Hudi (Uber-origin) + Trino + Pinot + Cadence + Michelangelo

The 2026 stack is best understood as five layers, each anchored by a component either invented or significantly hardened at Uber.

Storage and table format: Apache Hudi is the default table format for the data lake. Hudi was created at Uber by Vinoth Chandar in 2016 to bring upserts, deletes, and incremental processing to HDFS/parquet-based lakes. Copy-on-write tables suit read-heavy analytical workloads; merge-on-read tables suit high-frequency upserts like driver location and trip corrections. Record-level indexing keeps point lookups fast even on petabyte tables. The lake itself runs on HDFS and increasingly on cloud object storage.

Interactive SQL: Trino (formerly PrestoSQL) is the workhorse for interactive analytics. Uber runs one of the largest Trino deployments in the world, with thousands of daily users and queries spanning petabytes. Engineering blog posts have documented the team's contributions to Trino's reliability, cost-based optimizer tuning, and security model. For DE candidates, Trino is the tool you will use to answer ad-hoc questions and build dashboards.

Real-time OLAP: Apache Pinot powers user-facing real-time analytics. Pinot's strength is sub-second query latency on freshly ingested streaming data. At Uber it backs features like the rider and driver dashboards, surge analytics, and fraud monitoring consoles. DEs working on user-facing analytics will own Pinot tables, ingestion configs, and segment management.

Workflow orchestration: Cadence is Uber's open-source durable workflow engine, used internally for long-running, stateful orchestration across services. Cadence inspired the Temporal fork (several Cadence creators went on to build Temporal). For batch DAGs, Uber also uses Piper, an internal Airflow-derived scheduler.

ML platform: Michelangelo is Uber's end-to-end ML platform. It integrates feature stores, training pipelines, model registry, deployment, and monitoring on top of the same data infrastructure. DEs supporting ML teams build feature pipelines that land in Michelangelo's feature store, where they are reused across trip ETA, fraud, and pricing models.

For DE candidates, the practical implication is that you will be asked, in interviews and on the job, to reason about lake-versus-OLAP trade-offs, incremental processing on Hudi, and the cost-and-latency profile of Trino versus Pinot at scale.

Frequently asked questions

Did Apache Hudi really start at Uber?
Yes. Vinoth Chandar and team created Hudi at Uber in 2016 to bring incremental processing and upserts to the Hadoop data lake. It was open-sourced in 2017, joined the Apache Incubator in 2019, and graduated to a top-level project in 2020.
What is Apache Pinot used for at Uber?
Pinot powers user-facing real-time analytics: the rider and driver dashboards, surge analytics, fraud monitoring consoles, and similar features that need sub-second query latency on streaming data. DEs working on those surfaces own the Pinot tables and ingestion configs.
Is Cadence the same as Temporal?
They share roots. Cadence is Uber's open-source durable workflow engine. Several Cadence creators later built Temporal as a separate fork with its own roadmap. Uber continues to run Cadence internally, though Temporal has more momentum in the broader ecosystem.
What is Michelangelo?
Michelangelo is Uber's end-to-end ML platform. It integrates feature engineering, training pipelines, model registry, deployment, and monitoring. DEs supporting ML build feature pipelines that land in Michelangelo's feature store and are reused across trip ETA, pricing, and fraud models.
What languages should I know for an Uber DE interview?
Strong SQL is non-negotiable. Python is expected for data manipulation and pipeline scripting. Java or Scala fluency is a plus for Spark and Hudi work, since much of Uber's data infrastructure is JVM-based. Go appears occasionally for backend services.
How long is the Uber DE interview loop?
Typically two to four weeks from recruiter screen to offer. The full loop is five to six rounds: phone screen, SQL and modeling, pipeline design, coding, and behavioral, with a possible cross-functional round for staff candidates.
What is the salary range for a senior data engineer at Uber?
Per Levels.fyi data, L5a senior DEs in the Bay Area generally see total compensation in the $300K-$380K range, with base around $190K-$220K and the rest in RSUs and bonus. L5b lands ~$360K-$450K. Numbers move with stock price and offer cycle.
Is Uber DE work remote-friendly in 2026?
Uber operates a hybrid model, with most engineering roles expected in-office at least three days per week in a hub city (San Francisco, Sunnyvale, Seattle, New York). Fully remote DE roles are rare and typically require strong existing tenure or specialized expertise.

Sources


About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.