Streaming and Event Processing for Data Engineers: Kafka, Flink, CDC, and Exactly-Once (2026)
In short
Streaming and event processing is the modern alternative to batch ETL. Apache Kafka is the de-facto distributed log: producers append to partitioned topics, consumer groups read with tracked offsets, partition keys preserve per-entity ordering. For stateful work (joins, windowed aggregations, sessionization) Apache Flink is the low-latency reference engine; Kafka Streams is the lighter library when the workload is Kafka-native. CDC via Debezium turns Postgres/MySQL row changes into Kafka events without dual-writes. Exactly-once is effectively-once end-to-end — idempotent producers, Kafka transactions, and Flink two-phase-commit checkpoints close the gap. Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming SQL materialized views.
Key takeaways
- Apache Kafka is the canonical distributed log. Topics are partitioned for parallelism; order is per-partition, never global. Partition-key choice is the single most important design decision.
- Consumer groups give parallelism with one-consumer-per-partition semantics. Offsets are the durable cursor — auto-commit is convenient and dangerous; commit explicitly after the side effect succeeds.
- Apache Flink is the reference stateful streaming engine: event-time watermarks, RocksDB-backed keyed state, incremental checkpoints to S3/GCS. The choice for non-trivial joins, sessionization, sub-second SLOs.
- Kafka Streams is a JVM library, not a cluster. Pick it for Kafka-native, JVM-only, simpler topologies; pick Flink for cross-source workloads or Python-led teams using PyFlink.
- CDC with Debezium reads the database transaction log (Postgres logical replication, MySQL binlog) and emits per-row INSERT/UPDATE/DELETE events. The correct way to stream a relational database — never poll, never dual-write.
- Exactly-once is effectively-once end-to-end. The broker may physically deliver twice; idempotent producers, Kafka transactions, and Flink's two-phase-commit sinks make the observable outcome single-application.
- Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming materialized views. AWS Kinesis and Apache Pulsar are the credible alternatives to self-hosted Kafka.
Kafka 101: topics, partitions, consumer groups, offsets
Apache Kafka is a distributed append-only log. The Kafka documentation is the canonical reference; the operational model is worth internalizing before any framework on top of it.
- Topic. A named stream divided into partitions — the unit of parallelism, ordering, and replication. A topic with 12 partitions can be read in parallel by up to 12 consumers in one group.
- Partition. An ordered, immutable sequence of records. Each record has a monotonically-increasing 64-bit offset. Order is guaranteed within a partition only. Partitioning by a stable key (e.g.
user_id) ensures per-entity ordering. - Consumer group. Kafka assigns each partition to exactly one consumer in the group. Partitions are the parallelism ceiling — additional consumers sit idle.
- Offset. The consumer's cursor, stored in the internal
__consumer_offsetstopic. Auto-commit is the dangerous default; production systems commit explicitly after the side effect succeeds. - Durability.
min.insync.replicasplus produceracks=allis the durability contract — both must be set or the cluster will silently accept writes that can be lost.
A minimal Python producer and consumer using confluent-kafka-python:
from confluent_kafka import Producer, Consumer
producer = Producer({
"bootstrap.servers": "broker:9092",
"acks": "all",
"enable.idempotence": True,
})
producer.produce("orders", key="user-42", value=b'{"order_id": 1001}')
producer.flush()
consumer = Consumer({
"bootstrap.servers": "broker:9092",
"group.id": "orders-processor",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
})
consumer.subscribe(["orders"])
while True:
msg = consumer.poll(1.0)
if msg is None or msg.error():
continue
process(msg.value())
consumer.commit(msg) # commit only after success
enable.idempotence=True gives per-partition deduplication on retries via producer-id and sequence numbers. enable.auto.commit=False with explicit commit(msg) after successful processing is the at-least-once primitive most production pipelines build on.
Stateful streaming: Flink vs Kafka Streams
Stateless transforms (filter, map, project) are trivial. Stateful work — joins, windowed aggregations, sessionization, deduplication, top-K — requires per-key local state that survives restarts and rescaling. The two dominant choices are Apache Flink and Kafka Streams.
Apache Flink is a JVM-based clustered runtime. You compose a DAG of operators; Flink schedules tasks across TaskManagers coordinated by a JobManager. Each keyed operator has local state backed by RocksDB on TaskManager local disk, incrementally checkpointed to S3/GCS/HDFS.
- Best-in-class event-time processing with watermarks, allowed lateness, side outputs, and savepoints for versioned state migration.
- SQL via Flink SQL plus the Table API; PyFlink for Python teams (with UDF overhead through Py4J).
- Sub-second end-to-end latency at scale; production at Stripe, Netflix, Uber, Pinterest at millions of events per second.
- Operational cost is real. Most teams use a managed offering (AWS Managed Service for Apache Flink, Confluent Cloud Flink, Aiven, Ververica).
Kafka Streams is a Java/Scala library, not a cluster. Your application JVM hosts the topology; Kafka itself stores state-store changelogs and coordinates rebalancing. No separate runtime to operate.
- Lower operational footprint — an ordinary JVM service. Scale by running more instances.
- Tightly coupled to Kafka in / Kafka out. Cross-system integration requires Kafka Connect.
- JVM-only. No production-grade Python or Go equivalent.
- Topology complexity scales worse than Flink; recovery is slower because state is rebuilt by replaying the changelog topic.
Decision rule: Flink for multi-source workloads, sub-second SLOs, complex event-time semantics, or Python-led teams. Kafka Streams for Kafka in / Kafka out, JVM-native teams, single-digit-second budgets, when avoiding a separate clustered system has organizational value. Both pass the exactly-once test below; both run at scale at tier-1 companies. Fundamentals of Data Engineering (Reis & Housley, O'Reilly 2022) is the canonical book reference.
Change Data Capture (CDC) with Debezium
The wrong way to stream a relational database is to poll: SELECT * FROM orders WHERE updated_at > ? misses deletes, double-counts updates between polls, hammers the primary, and drifts under load. The right way is CDC: read the database's transaction log and emit one Kafka event per row mutation. Debezium is the open-source reference, deployed as Kafka Connect source connectors.
Debezium reads Postgres logical replication slots (pgoutput), MySQL binlog, SQL Server CDC tables, MongoDB oplog, or Oracle LogMiner. Each mutation becomes a Kafka message with a structured envelope: op (c/u/d/r), before and after images, source metadata (LSN, transaction id), and a tombstone for deletes when log compaction is enabled.
A representative Postgres connector configuration:
{
"name": "orders-postgres-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "pgoutput",
"database.hostname": "postgres.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets/db.properties:password}",
"database.dbname": "orders_prod",
"topic.prefix": "orders_prod",
"table.include.list": "public.orders,public.order_items",
"slot.name": "debezium_orders",
"publication.autocreate.mode": "filtered",
"snapshot.mode": "initial",
"tombstones.on.delete": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter"
}
}
Operational notes that catch new teams: replication slots can fill the WAL — if the connector pauses, Postgres retains WAL until the slot advances; monitor pg_replication_slots.confirmed_flush_lsn. snapshot.mode=initial reads every row of every included table as op=r; use schema_only when you have an alternative bulk-load path. Schema evolution is additive-safe via the schema registry; breaking changes require explicit coordination. One topic per table is the default — Flink stream-stream joins denormalize before publishing fact streams.
Exactly-once semantics — what it actually guarantees
Exactly-once is the most-marketed and most-misunderstood feature in distributed streaming. The honest definition: effectively-once end-to-end. The broker may physically deliver a message twice on retry; the system is constructed so the observable downstream effect is identical to single delivery. Three components compose the guarantee in the Kafka + Flink stack.
- Idempotent producer.
enable.idempotence=trueassigns a producer-id and tracks per-partition sequence numbers; the broker deduplicates retries within a session. Local exactly-once for produce, not end-to-end. - Transactional producer.
transactional.idplusinitTransactions(),beginTransaction(),commitTransaction()wraps sends across partitions and topics atomically. Consumers withisolation.level=read_committedsee only committed transactions. This is the read-process-write pattern: consume, transform, produce, commit offsets and records together. - Flink two-phase-commit checkpointing.
TwoPhaseCommitSinkFunctioncoordinates external sinks with checkpoint barriers. Phase 1 pre-commits; phase 2 commits on global checkpoint completion. The Kafka sink is the canonical implementation; JDBC sinks need XA or application-level idempotency.
A minimal Flink Java configuration for exactly-once:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE);
CheckpointConfig cfg = env.getCheckpointConfig();
cfg.setMinPauseBetweenCheckpoints(30_000);
cfg.setCheckpointTimeout(120_000);
cfg.setMaxConcurrentCheckpoints(1);
cfg.setExternalizedCheckpointCleanup(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
// sink: KafkaSink.builder().setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE) ...
What exactly-once does not give you: protection against non-idempotent external side effects outside the checkpoint barrier (emails, credit-card charges, third-party REST calls — those need application-level idempotency keys); improved throughput (transactions add latency); freedom from thinking about replay, savepoints, and schema evolution. Practical rule: turn it on for financial, billing, and audit-grade pipelines; run at-least-once for analytics where downstream tables dedupe by key anyway.
Watermarks, Kappa, materialized views, and the Kafka alternatives
Watermarks. Processing-time is easy and wrong for analytics; event-time is correct. A watermark of timestamp T asserts no earlier event will arrive. Windows emit when the watermark crosses the end; allowed lateness extends retention; later events go to a side output. WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofMinutes(5)) is the practical default — bound from observed delay, not optimism.
Kappa. Jay Kreps's 2014 essay introduced Kappa as the alternative to Marz's Lambda. Lambda's complaint is two implementations of the same business logic; Kappa runs one streaming pipeline and replays the log to recompute. Now the default at streaming-first orgs.
Streaming materialized views. Materialize (Timely Dataflow) and RisingWave (Rust, PostgreSQL-wire-compatible) operationalize Kappa: write standard SQL, get incrementally-maintained views over streams. Younger than Flink; appropriate when the workload fits SQL and latency budget is in seconds.
Kafka alternatives. AWS Kinesis is the AWS-native managed option (shards = partitions); Firehose sinks to S3/Redshift; Managed Service for Apache Flink runs jobs. Operationally simpler, lower per-shard throughput, smaller connector ecosystem. Apache Pulsar is the open-source alternative — segmented BookKeeper storage, native multi-tenancy, built-in geo-replication, tiered storage to S3, adopted at Yahoo, Tencent, Splunk. Right when multi-tenancy or geo-replication is first-order; Kafka remains the broader-ecosystem default.
Frequently asked questions
- When should I pick Kafka Streams over Flink?
- Kafka Streams when the workload is Kafka in / Kafka out, the team is JVM-native, latency budgets are single-digit seconds, and avoiding a separate clustered runtime has organizational value. Flink when you have multiple sources (Kafka plus CDC plus REST), need sub-second SLOs, require complex event-time semantics, or the team is Python-led on PyFlink.
- Is exactly-once semantics actually exactly-once?
- No. The accurate name is effectively-once end-to-end. The broker may physically redeliver on retry; the observable downstream effect is single-application via idempotent producers, transactional sends, and Flink's two-phase-commit sinks. Effects outside the transaction boundary — REST calls, emails, charges — still need application-level idempotency keys.
- How do I choose a partition count and partition key?
- Partition count is the parallelism ceiling — you cannot read with more consumers per group than partitions. Start with 2-3x peak expected consumer count, ~10 MB/s per partition; common defaults 12, 24, 48. Adding partitions later breaks per-key ordering, so over-provision modestly. Always pick a stable, high-cardinality partition key (user_id, account_id, device_id) — keyless records go round-robin and break stream-stream joins.
- How long should I retain a Kafka topic for Kappa-style replay?
- The max of your longest expected backfill window and the longest downstream-incident recovery time. Typical production: 7 days for high-volume operational topics, 30-90 days for analytical topics, infinite for source-of-truth event-sourcing logs offloaded via Kafka Tiered Storage or Pulsar's segmented storage to S3.
- What is a sensible Flink checkpoint interval?
- 60 seconds with 30-second min-pause is a reasonable starting point for sub-minute recovery. 10-30 seconds for tight RTO; 5-10 minutes for very large state where checkpoint duration becomes the bottleneck. Always enable incremental checkpoints on RocksDB at non-trivial state size, externalize to durable object storage, retain on cancellation.
Sources
- Apache Kafka — official documentation. Topics, partitions, consumer groups, idempotent producers, transactions, replication. The canonical reference.
- Apache Flink — Learn Flink tutorial sequence. Event-time processing, watermarks, keyed state, checkpointing, two-phase-commit sinks. The reference low-latency stateful streaming engine.
- Debezium — official reference documentation. Postgres, MySQL, SQL Server, MongoDB, Oracle connectors. The canonical CDC implementation on Kafka Connect.
- Confluent Blog — long-running engineering content on Kafka design, transactions, streaming SQL, schema registry, ksqlDB, and Flink-on-Confluent. Multiple definitive posts on exactly-once semantics by the original authors.
- AWS Kinesis — Data Streams, Data Firehose, and Managed Service for Apache Flink. The AWS-native managed alternative to self-hosted Kafka with a different operational and cost profile.
- Reis & Housley — Fundamentals of Data Engineering (O'Reilly, 2022). Streaming chapters cover Kafka, Flink, CDC, exactly-once, and the Kappa pattern.
About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.