Data Engineer Hub

Streaming and Event Processing for Data Engineers: Kafka, Flink, CDC, and Exactly-Once (2026)

In short

Streaming and event processing is the modern alternative to batch ETL. Apache Kafka is the de-facto distributed log: producers append to partitioned topics, consumer groups read with tracked offsets, partition keys preserve per-entity ordering. For stateful work (joins, windowed aggregations, sessionization) Apache Flink is the low-latency reference engine; Kafka Streams is the lighter library when the workload is Kafka-native. CDC via Debezium turns Postgres/MySQL row changes into Kafka events without dual-writes. Exactly-once is effectively-once end-to-end — idempotent producers, Kafka transactions, and Flink two-phase-commit checkpoints close the gap. Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming SQL materialized views.

Key takeaways

  • Apache Kafka is the canonical distributed log. Topics are partitioned for parallelism; order is per-partition, never global. Partition-key choice is the single most important design decision.
  • Consumer groups give parallelism with one-consumer-per-partition semantics. Offsets are the durable cursor — auto-commit is convenient and dangerous; commit explicitly after the side effect succeeds.
  • Apache Flink is the reference stateful streaming engine: event-time watermarks, RocksDB-backed keyed state, incremental checkpoints to S3/GCS. The choice for non-trivial joins, sessionization, sub-second SLOs.
  • Kafka Streams is a JVM library, not a cluster. Pick it for Kafka-native, JVM-only, simpler topologies; pick Flink for cross-source workloads or Python-led teams using PyFlink.
  • CDC with Debezium reads the database transaction log (Postgres logical replication, MySQL binlog) and emits per-row INSERT/UPDATE/DELETE events. The correct way to stream a relational database — never poll, never dual-write.
  • Exactly-once is effectively-once end-to-end. The broker may physically deliver twice; idempotent producers, Kafka transactions, and Flink's two-phase-commit sinks make the observable outcome single-application.
  • Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming materialized views. AWS Kinesis and Apache Pulsar are the credible alternatives to self-hosted Kafka.

Kafka 101: topics, partitions, consumer groups, offsets

Apache Kafka is a distributed append-only log. The Kafka documentation is the canonical reference; the operational model is worth internalizing before any framework on top of it.

  • Topic. A named stream divided into partitions — the unit of parallelism, ordering, and replication. A topic with 12 partitions can be read in parallel by up to 12 consumers in one group.
  • Partition. An ordered, immutable sequence of records. Each record has a monotonically-increasing 64-bit offset. Order is guaranteed within a partition only. Partitioning by a stable key (e.g. user_id) ensures per-entity ordering.
  • Consumer group. Kafka assigns each partition to exactly one consumer in the group. Partitions are the parallelism ceiling — additional consumers sit idle.
  • Offset. The consumer's cursor, stored in the internal __consumer_offsets topic. Auto-commit is the dangerous default; production systems commit explicitly after the side effect succeeds.
  • Durability. min.insync.replicas plus producer acks=all is the durability contract — both must be set or the cluster will silently accept writes that can be lost.

A minimal Python producer and consumer using confluent-kafka-python:

from confluent_kafka import Producer, Consumer

producer = Producer({
    "bootstrap.servers": "broker:9092",
    "acks": "all",
    "enable.idempotence": True,
})
producer.produce("orders", key="user-42", value=b'{"order_id": 1001}')
producer.flush()

consumer = Consumer({
    "bootstrap.servers": "broker:9092",
    "group.id": "orders-processor",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
})
consumer.subscribe(["orders"])
while True:
    msg = consumer.poll(1.0)
    if msg is None or msg.error():
        continue
    process(msg.value())
    consumer.commit(msg)  # commit only after success

enable.idempotence=True gives per-partition deduplication on retries via producer-id and sequence numbers. enable.auto.commit=False with explicit commit(msg) after successful processing is the at-least-once primitive most production pipelines build on.

Stateful streaming: Flink vs Kafka Streams

Stateless transforms (filter, map, project) are trivial. Stateful work — joins, windowed aggregations, sessionization, deduplication, top-K — requires per-key local state that survives restarts and rescaling. The two dominant choices are Apache Flink and Kafka Streams.

Apache Flink is a JVM-based clustered runtime. You compose a DAG of operators; Flink schedules tasks across TaskManagers coordinated by a JobManager. Each keyed operator has local state backed by RocksDB on TaskManager local disk, incrementally checkpointed to S3/GCS/HDFS.

  • Best-in-class event-time processing with watermarks, allowed lateness, side outputs, and savepoints for versioned state migration.
  • SQL via Flink SQL plus the Table API; PyFlink for Python teams (with UDF overhead through Py4J).
  • Sub-second end-to-end latency at scale; production at Stripe, Netflix, Uber, Pinterest at millions of events per second.
  • Operational cost is real. Most teams use a managed offering (AWS Managed Service for Apache Flink, Confluent Cloud Flink, Aiven, Ververica).

Kafka Streams is a Java/Scala library, not a cluster. Your application JVM hosts the topology; Kafka itself stores state-store changelogs and coordinates rebalancing. No separate runtime to operate.

  • Lower operational footprint — an ordinary JVM service. Scale by running more instances.
  • Tightly coupled to Kafka in / Kafka out. Cross-system integration requires Kafka Connect.
  • JVM-only. No production-grade Python or Go equivalent.
  • Topology complexity scales worse than Flink; recovery is slower because state is rebuilt by replaying the changelog topic.

Decision rule: Flink for multi-source workloads, sub-second SLOs, complex event-time semantics, or Python-led teams. Kafka Streams for Kafka in / Kafka out, JVM-native teams, single-digit-second budgets, when avoiding a separate clustered system has organizational value. Both pass the exactly-once test below; both run at scale at tier-1 companies. Fundamentals of Data Engineering (Reis & Housley, O'Reilly 2022) is the canonical book reference.

Change Data Capture (CDC) with Debezium

The wrong way to stream a relational database is to poll: SELECT * FROM orders WHERE updated_at > ? misses deletes, double-counts updates between polls, hammers the primary, and drifts under load. The right way is CDC: read the database's transaction log and emit one Kafka event per row mutation. Debezium is the open-source reference, deployed as Kafka Connect source connectors.

Debezium reads Postgres logical replication slots (pgoutput), MySQL binlog, SQL Server CDC tables, MongoDB oplog, or Oracle LogMiner. Each mutation becomes a Kafka message with a structured envelope: op (c/u/d/r), before and after images, source metadata (LSN, transaction id), and a tombstone for deletes when log compaction is enabled.

A representative Postgres connector configuration:

{
  "name": "orders-postgres-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "plugin.name": "pgoutput",
    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db.properties:password}",
    "database.dbname": "orders_prod",
    "topic.prefix": "orders_prod",
    "table.include.list": "public.orders,public.order_items",
    "slot.name": "debezium_orders",
    "publication.autocreate.mode": "filtered",
    "snapshot.mode": "initial",
    "tombstones.on.delete": "true",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter"
  }
}

Operational notes that catch new teams: replication slots can fill the WAL — if the connector pauses, Postgres retains WAL until the slot advances; monitor pg_replication_slots.confirmed_flush_lsn. snapshot.mode=initial reads every row of every included table as op=r; use schema_only when you have an alternative bulk-load path. Schema evolution is additive-safe via the schema registry; breaking changes require explicit coordination. One topic per table is the default — Flink stream-stream joins denormalize before publishing fact streams.

Exactly-once semantics — what it actually guarantees

Exactly-once is the most-marketed and most-misunderstood feature in distributed streaming. The honest definition: effectively-once end-to-end. The broker may physically deliver a message twice on retry; the system is constructed so the observable downstream effect is identical to single delivery. Three components compose the guarantee in the Kafka + Flink stack.

  1. Idempotent producer. enable.idempotence=true assigns a producer-id and tracks per-partition sequence numbers; the broker deduplicates retries within a session. Local exactly-once for produce, not end-to-end.
  2. Transactional producer. transactional.id plus initTransactions(), beginTransaction(), commitTransaction() wraps sends across partitions and topics atomically. Consumers with isolation.level=read_committed see only committed transactions. This is the read-process-write pattern: consume, transform, produce, commit offsets and records together.
  3. Flink two-phase-commit checkpointing. TwoPhaseCommitSinkFunction coordinates external sinks with checkpoint barriers. Phase 1 pre-commits; phase 2 commits on global checkpoint completion. The Kafka sink is the canonical implementation; JDBC sinks need XA or application-level idempotency.

A minimal Flink Java configuration for exactly-once:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE);
CheckpointConfig cfg = env.getCheckpointConfig();
cfg.setMinPauseBetweenCheckpoints(30_000);
cfg.setCheckpointTimeout(120_000);
cfg.setMaxConcurrentCheckpoints(1);
cfg.setExternalizedCheckpointCleanup(
    CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
// sink: KafkaSink.builder().setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE) ...

What exactly-once does not give you: protection against non-idempotent external side effects outside the checkpoint barrier (emails, credit-card charges, third-party REST calls — those need application-level idempotency keys); improved throughput (transactions add latency); freedom from thinking about replay, savepoints, and schema evolution. Practical rule: turn it on for financial, billing, and audit-grade pipelines; run at-least-once for analytics where downstream tables dedupe by key anyway.

Watermarks, Kappa, materialized views, and the Kafka alternatives

Watermarks. Processing-time is easy and wrong for analytics; event-time is correct. A watermark of timestamp T asserts no earlier event will arrive. Windows emit when the watermark crosses the end; allowed lateness extends retention; later events go to a side output. WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofMinutes(5)) is the practical default — bound from observed delay, not optimism.

Kappa. Jay Kreps's 2014 essay introduced Kappa as the alternative to Marz's Lambda. Lambda's complaint is two implementations of the same business logic; Kappa runs one streaming pipeline and replays the log to recompute. Now the default at streaming-first orgs.

Streaming materialized views. Materialize (Timely Dataflow) and RisingWave (Rust, PostgreSQL-wire-compatible) operationalize Kappa: write standard SQL, get incrementally-maintained views over streams. Younger than Flink; appropriate when the workload fits SQL and latency budget is in seconds.

Kafka alternatives. AWS Kinesis is the AWS-native managed option (shards = partitions); Firehose sinks to S3/Redshift; Managed Service for Apache Flink runs jobs. Operationally simpler, lower per-shard throughput, smaller connector ecosystem. Apache Pulsar is the open-source alternative — segmented BookKeeper storage, native multi-tenancy, built-in geo-replication, tiered storage to S3, adopted at Yahoo, Tencent, Splunk. Right when multi-tenancy or geo-replication is first-order; Kafka remains the broader-ecosystem default.

Frequently asked questions

When should I pick Kafka Streams over Flink?
Kafka Streams when the workload is Kafka in / Kafka out, the team is JVM-native, latency budgets are single-digit seconds, and avoiding a separate clustered runtime has organizational value. Flink when you have multiple sources (Kafka plus CDC plus REST), need sub-second SLOs, require complex event-time semantics, or the team is Python-led on PyFlink.
Is exactly-once semantics actually exactly-once?
No. The accurate name is effectively-once end-to-end. The broker may physically redeliver on retry; the observable downstream effect is single-application via idempotent producers, transactional sends, and Flink's two-phase-commit sinks. Effects outside the transaction boundary — REST calls, emails, charges — still need application-level idempotency keys.
How do I choose a partition count and partition key?
Partition count is the parallelism ceiling — you cannot read with more consumers per group than partitions. Start with 2-3x peak expected consumer count, ~10 MB/s per partition; common defaults 12, 24, 48. Adding partitions later breaks per-key ordering, so over-provision modestly. Always pick a stable, high-cardinality partition key (user_id, account_id, device_id) — keyless records go round-robin and break stream-stream joins.
How long should I retain a Kafka topic for Kappa-style replay?
The max of your longest expected backfill window and the longest downstream-incident recovery time. Typical production: 7 days for high-volume operational topics, 30-90 days for analytical topics, infinite for source-of-truth event-sourcing logs offloaded via Kafka Tiered Storage or Pulsar's segmented storage to S3.
What is a sensible Flink checkpoint interval?
60 seconds with 30-second min-pause is a reasonable starting point for sub-minute recovery. 10-30 seconds for tight RTO; 5-10 minutes for very large state where checkpoint duration becomes the bottleneck. Always enable incremental checkpoints on RocksDB at non-trivial state size, externalize to durable object storage, retain on cancellation.

Sources

  1. Apache Kafka — official documentation. Topics, partitions, consumer groups, idempotent producers, transactions, replication. The canonical reference.
  2. Debezium — official reference documentation. Postgres, MySQL, SQL Server, MongoDB, Oracle connectors. The canonical CDC implementation on Kafka Connect.
  3. Confluent Blog — long-running engineering content on Kafka design, transactions, streaming SQL, schema registry, ksqlDB, and Flink-on-Confluent. Multiple definitive posts on exactly-once semantics by the original authors.
  4. AWS Kinesis — Data Streams, Data Firehose, and Managed Service for Apache Flink. The AWS-native managed alternative to self-hosted Kafka with a different operational and cost profile.
  5. Reis & Housley — Fundamentals of Data Engineering (O'Reilly, 2022). Streaming chapters cover Kafka, Flink, CDC, exactly-once, and the Kappa pattern.

About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.