Data Engineer Job Description: Duties, Skills, Salary, and Career Path

Updated April 30, 2026 Current
Quick Answer

Data Engineer Job Description: Duties, Skills, Salary, and Career Path The Bureau of Labor Statistics projects 4 percent employment growth for database administrators and architects — the classification that includes data engineers — from 2024 to...

Data Engineer Job Description: Duties, Skills, Salary, and Career Path

Data engineer is one of the senior technical roles the U.S. Bureau of Labor Statistics does not separate into its own occupational code. The closest BLS proxies are Database Architects (SOC 15-1242) at $141,210 and Software Developers (SOC 15-1252) at $133,080 (May 2024 median) [1][2]. Neither captures the role; the most defensible public compensation data lives on levels.fyi. The job itself is well-defined: an engineer who designs and operates the systems that turn raw data into reliable, queryable, decision-ready assets, and is on the hook when those systems break at 3 a.m.

Key Takeaways

  • A data engineer (DE) builds and operates the data platform: ingestion, storage, transformation, orchestration, modeling, quality, and serving. The role spans batch and streaming, warehouse and lakehouse.
  • BLS does not classify data engineers as a distinct occupation. The closest proxies — SOC 15-1242 Database Architects ($141,210 median, May 2024) and SOC 15-1252 Software Developers ($133,080) — both miss. The most reliable public source is levels.fyi's Data Engineer track [1][2][3].
  • Typical entry path is a CS or adjacent degree plus two to four shipped pipelines, or lateral movement from software engineering, analytics engineering, or DBA work.
  • Core skills are SQL, Python, distributed systems literacy, dimensional modeling, orchestration (Airflow, Dagster, Prefect), dbt, streaming (Kafka, Flink, Kinesis), and data quality. Writing is usually the differentiator between a competent engineer and a senior one.
  • The IC track typically progresses Junior -> Data Engineer -> Senior -> Staff -> Principal, with parallel platform-engineering and analytics-engineering tracks. For role context see our data engineer career hub.

What Does a Data Engineer Do?

A data engineer turns raw, fast-moving data into systems that analysts, scientists, and product teams can trust. Joe Reis and Matt Housley, in Fundamentals of Data Engineering, frame the job as disciplined operation of the data engineering lifecycle — generation, ingestion, transformation, storage, and serving — under cross-cutting concerns of security, DataOps, architecture, orchestration, and software engineering [4]. Maxime Beauchemin, who created Apache Airflow and Apache Superset, argued that a data engineer's real output is a queryable, well-modeled, well-documented asset that compounds in value as the org grows [5].

A typical week runs on three loops. The daily loop is on-call: triaging pipeline failures, freshness alerts, schema drift, "this number looks wrong" questions. The sprint loop is feature work: new ingestions, dbt models, tests, refactors. The quarterly loop is platform work: cost reviews, warehouse refactors, orchestration upgrades, lineage, governance.

Ralph Kimball's dimensional modeling — the canonical playbook from the Kimball Group — is still the working language of warehouse design: facts, dimensions, slowly changing dimensions, grain [6]. dbt Labs' analytics-engineering pattern, articulated across getdbt.com/blog, layered software-engineering discipline on top: version control, modular SQL, tests, docs, CI [7]. A working data engineer uses both maps every week.

Core Responsibilities

Pipeline and platform work consumes roughly 50 percent of the time:

  1. Build and operate ingestion: connect source systems (application databases, event streams, third-party APIs, files) to the warehouse or lakehouse with the right cadence — batch, micro-batch, or streaming. CDC, Fivetran/Airbyte connectors, and Kafka-based event ingestion are all in scope.
  2. Model the warehouse: design fact and dimension tables that survive new questions and new product surfaces. Kimball's grain-first discipline remains the test of a working model [6].
  3. Write transformations: SQL-first in dbt, Python or Spark for heavier lifts. Modular models, tests, and docs are non-negotiable [7].
  4. Orchestrate: Airflow, Dagster, or Prefect DAGs that are idempotent, retryable, observable, and deterministic. Beauchemin's "Rise of the Data Engineer" remains the clearest articulation of why orchestration belongs to engineering [5].
  5. Operate streaming: Kafka or Kinesis with Flink, Spark Structured Streaming, or Materialize. confluent.io/blog is the working reference [8].

Reliability, quality, and partnership, roughly 30 percent:

  1. Own data quality: tests in dbt or Great Expectations, freshness SLAs, schema contracts at the source, and the runbooks that make 3 a.m. on-call survivable.
  2. Embed with stakeholders: standing time with analysts, scientists, and PMs. The strongest engineers translate fuzzy product questions into the right grain, metric, and model.
  3. Run incident response: when a pipeline breaks or a number is wrong, own the timeline, fix, and postmortem. The AWS Builders' Library is the strongest public collection on operational discipline at scale [9].
  4. Document the platform: lineage, ownership, freshness, and data contracts. Senior leverage is in turning tribal knowledge into searchable artifacts.

Architecture, cost, and craft, the remaining 20 percent:

  1. Choose the stack: warehouse vs. lakehouse, Snowflake vs. BigQuery vs. Databricks vs. Redshift, Iceberg vs. Delta vs. Hudi for open table formats. The Databricks engineering team publishes the canonical lakehouse argument at databricks.com/blog/category/engineering [10].
  2. Tune for cost: warehouse credits, storage tiering, partition and cluster keys, materialization. At scale, a single bad model can cost five figures a month.
  3. Refactor: dead models, abandoned dashboards, undocumented columns.
  4. Pilot new techniques: open table formats, declarative orchestration, semantic layers.

Required Skills

Data engineering sits at the intersection of distributed systems, software engineering, and analytical modeling — and senior hires reflect that.

SQL comes first, and the bar is higher than most candidates expect. Window functions, CTEs, query plans, partitioning behavior, and the difference between a model that works on a 10M-row sample and one that survives a 10B-row production table are the working vocabulary [6][7].

Python is the second pillar — pandas, PySpark, asyncio for I/O-bound ingestion, packaging, testing, type hints. Reis and Housley argue data engineering is a software engineering discipline with data-shaped concerns, and the codebase has to reflect that [4].

Distributed systems literacy is non-negotiable above the junior band. A working engineer understands partitioning, shuffle, skew, late-arriving data, exactly-once vs. at-least-once delivery, and why most "real-time" requirements turn out to be five-minute batch on closer inspection [8][10].

Warehouse modeling is the discipline-defining skill. Taking a fuzzy stakeholder request and producing a grain, a fact table, dimensions, and a metric definition that survives the next five questions separates a junior engineer from a senior one [6].

Orchestration, dbt, streaming, and data quality are the working tooling layer. Airflow or Dagster for batch, dbt for transformations, Kafka with Flink or Spark Structured Streaming for events, and tests-as-code for quality. Beauchemin's "Functional Data Engineering" is the best public articulation of the immutability and idempotency principles that make these systems survivable [5]. Writing is the multiplier: the strongest engineers write clear runbooks, postmortems, and design docs.

Education and Certifications

A bachelor's in computer science, software engineering, applied math, or a related quantitative field is the most common entry credential. Master's degrees are not required. The strongest senior candidates often have no graduate degree but five-plus years of shipped production pipelines.

Bootcamp and self-taught paths are real, particularly for engineers moving from analytics, BI development, or general software engineering. The strongest non-traditional path is a portfolio of two to four production-grade pipelines plus working experience operating a warehouse or lakehouse.

Cloud certifications carry real signal: AWS Certified Data Engineer - Associate, Google Professional Data Engineer, and Microsoft Azure Data Engineer Associate are the most common. Snowflake and Databricks run well-respected platform certifications. None are required; all signal seriousness.

The cited canon is Reis and Housley's Fundamentals of Data Engineering, Kimball and Ross's The Data Warehouse Toolkit, Kleppmann's Designing Data-Intensive Applications, Beauchemin's essays, the dbt Labs blog, the Confluent and Databricks engineering blogs, and the AWS Builders' Library [4][5][6][7][8][9][10][11].

Work Environment and Schedule

Data engineers work in tech, financial services, healthcare, retail, and increasingly any enterprise with a meaningful analytics or ML practice. Many roles are remote or hybrid. The work is desk-bound: SQL, Python, terminals, dashboards, runbooks.

The week splits between pipeline development, on-call, modeling, and stakeholder syncs. Maker time goes to writing — design docs, runbooks, postmortems. Warehouse design under interruption produces shallow decisions.

On-call rotations are standard at any company with production pipelines. A typical rotation is one week per month, with paging tied to freshness SLAs, pipeline failures, and warehouse cost anomalies. The on-call experience is the single best predictor of platform quality: a brutal rotation almost always means tests are missing, contracts are absent, and ownership is unclear.

Salary by Experience

BLS does not publish a Data Engineer occupation. SOC 15-1242 Database Architects ($141,210 May 2024 median) is the closest analog by content but understates FAANG-tier and AI-lab comp; SOC 15-1252 Software Developers ($133,080 median) sweeps the role into a much broader software population [1][2]. The most reliable public source is levels.fyi, which aggregates real offer data from Meta, Google, Apple, Stripe, Airbnb, Uber, Netflix, Databricks, Snowflake, OpenAI, Anthropic, and other major employers — see levels.fyi/t/data-engineer [3].

The structure that recurs across the industry:

  • Junior / Associate Data Engineer: zero to two years, often fresh from a CS bachelor's or lateral from analytics; scoped to ingestion work and dbt models on a single product surface.
  • Data Engineer (mid): two to five years, owns end-to-end pipelines, partners with one or two analytics teams, on-call participant.
  • Senior Data Engineer: five-plus years, owns a domain (a product surface, a warehouse subject area, or a platform component), mentors juniors, leads architecture for medium-scope projects. Total comp at FAANG-tier employers commonly clears the senior software engineer band.
  • Staff Data Engineer: cross-team scope, leads the data platform or a major subject area, sets technical direction. Comp approaches the staff software engineer band.
  • Principal Data Engineer: org-level scope, sets data architecture direction, mentors staff engineers. Comp overlaps the principal software engineer band.

AI-lab and FAANG-tier comp at staff and principal can move materially higher with equity, particularly where the data platform is a core competitive moat (Databricks, Snowflake, OpenAI, Anthropic, Stripe). Triangulate levels.fyi for the specific company tier and scope; the BLS proxies mislead at senior bands.

Career Outlook

BLS does not separate data engineers from broader categories, so there is no role-specific projection. Both proxy SOCs project growth above the national average through 2034: Software Developers (15-1252) at 17 percent and Database Architects (15-1242) at 9 percent over the decade, but each describes a different population than data engineering specifically [1][2].

Two structural pressures shape the outlook. The shift from Hadoop-era platforms to cloud warehouses and lakehouses has compressed the gap between analyst and engineer at the SQL layer while expanding it at the platform layer — analytics engineers absorbed dbt-modeling work, while data engineers moved deeper into streaming, lineage, governance, and platform. AI workloads — vector stores, feature stores, training-data pipelines, evaluation harnesses — have created sharp new demand for engineers who can operate these systems alongside the classical analytics stack. Teams that treat data engineering as a software discipline are growing; teams that treat it as tooling administration are shrinking.

How to Become a Data Engineer

Three paths are common. The first is CS-degree-to-industry: an internship, then a junior or mid-level role at a company with a meaningful data platform. The second is lateral from software engineering, analytics engineering, BI development, or DBA work. The third is bootcamp-plus-portfolio.

  1. Pick a platform spine. Decide whether your strength leans batch-warehouse (SQL, dbt, Airflow, Snowflake/BigQuery) or streaming-platform (Kafka, Flink, Spark, lakehouse). Read Reis and Housley for the lifecycle, Kimball for warehouse modeling, Beauchemin for orchestration, and Kleppmann's Designing Data-Intensive Applications for distributed systems [4][5][6][11].
  2. Build a portfolio of two to four production-grade pipelines. Each should show the source, ingestion, model, tests, orchestration, and runbook. The runbook is what distinguishes a portfolio from a class project.
  3. Develop the writing habit early. Publish design docs, postmortems, and modeling decisions.
  4. Get experience operating a warehouse. A free Snowflake, BigQuery, or Databricks tier plus a week of dbt models on a real dataset will teach more than any course.
  5. Learn one streaming tool deeply: Kafka with Flink, Spark Structured Streaming, or Kafka Streams is the most transferable [8].
  6. Find a mentor one or two levels ahead, in a similar organization type. The discipline differs sharply between FAANG, AI-lab, SaaS, and enterprise data teams.
  7. Plan the longer arc. Junior -> mid -> senior -> staff -> principal typically spans 8 to 15 years.

FAQ

Is data engineering the same as analytics engineering? No. An analytics engineer typically owns the dbt-modeling layer between the warehouse and the BI tool; a data engineer owns the platform — ingestion, orchestration, streaming, storage, lineage, and the warehouse itself. The roles overlap most heavily at small companies [7].

How is data engineering different from data science or software engineering? A data scientist asks questions of data and ships models; a software engineer ships product features; a data engineer ships the platform that makes both possible [4][11].

Do data engineers need a CS degree? Common at FAANG-tier employers, not strictly required. At senior bands, shipped production pipelines outweigh the credential.

What is the salary for a data engineer? BLS does not publish a Data Engineer median. The closest proxies — SOC 15-1242 ($141,210) and SOC 15-1252 ($133,080) — both miss. The most reliable public source is levels.fyi's Data Engineer track [1][2][3].

Is data engineering a good career? For people who get energy from operating systems, debugging at the seams between teams, and building durable assets that compound, yes. For people whose primary energy comes from analysis or product features, science or software engineering is usually a better fit.

What is the career path after senior data engineer? The IC track — staff, then principal — or the platform-engineering, analytics-engineering, and management tracks.

Does BLS publish data for data engineers? No. The closest occupations (SOC 15-1242 Database Architects and SOC 15-1252 Software Developers) both misrepresent the role; levels.fyi is the most defensible public compensation source [1][2][3].

Sources

  1. U.S. Bureau of Labor Statistics, OES, "Database Architects" (SOC 15-1242), May 2024 median wage $141,210.
  2. U.S. Bureau of Labor Statistics, Occupational Outlook Handbook, "Software Developers, Quality Assurance Analysts, and Testers" (SOC 15-1252), May 2024 median wage $133,080.
  3. levels.fyi, Data Engineer compensation track.
  4. Joe Reis and Matt Housley, Fundamentals of Data Engineering, O'Reilly, 2022.
  5. Maxime Beauchemin, "The Rise of the Data Engineer" and "Functional Data Engineering"; maximebeauchemin.medium.com.
  6. Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Wiley, 3rd ed. 2013; Kimball Group techniques.
  7. dbt Labs engineering blog, getdbt.com/blog.
  8. Confluent engineering blog, confluent.io/blog.
  9. Amazon Web Services, AWS Builders' Library.
  10. Databricks engineering blog, databricks.com/blog/category/engineering.
  11. Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly, 2017.
See what ATS software sees Your resume looks different to a machine. Free check — PDF, DOCX, or DOC.
Check My Resume

Tags

data engineer job description
Blake Crosley — Former VP of Design at ZipRecruiter, Founder of ResumeGeni

About Blake Crosley

Blake Crosley spent 12 years at ZipRecruiter, rising from Design Engineer to VP of Design. He designed interfaces used by 110M+ job seekers and built systems processing 7M+ resumes monthly. He founded ResumeGeni to help candidates communicate their value clearly.

12 Years at ZipRecruiter VP of Design 110M+ Job Seekers Served

Ready to build your resume?

Create an ATS-optimized resume that gets you hired.

Get Started Free