When should I pick Kafka Streams over Flink?

Kafka Streams when the workload is Kafka in / Kafka out, the team is JVM-native, latency budgets are single-digit seconds, and avoiding a separate clustered runtime has organizational value. Flink when you have multiple sources (Kafka plus CDC plus REST), need sub-second SLOs, require complex event-time semantics, or the team is Python-led on PyFlink.

Is exactly-once semantics actually exactly-once?

No. The accurate name is effectively-once end-to-end. The broker may physically redeliver on retry; the observable downstream effect is single-application via idempotent producers, transactional sends, and Flink's two-phase-commit sinks. Effects outside the transaction boundary — REST calls, emails, charges — still need application-level idempotency keys.

How do I choose a partition count and partition key?

Partition count is the parallelism ceiling — you cannot read with more consumers per group than partitions. Start with 2-3x peak expected consumer count, ~10 MB/s per partition; common defaults 12, 24, 48. Adding partitions later breaks per-key ordering, so over-provision modestly. Always pick a stable, high-cardinality partition key (user_id, account_id, device_id) — keyless records go round-robin and break stream-stream joins.

How long should I retain a Kafka topic for Kappa-style replay?

The max of your longest expected backfill window and the longest downstream-incident recovery time. Typical production: 7 days for high-volume operational topics, 30-90 days for analytical topics, infinite for source-of-truth event-sourcing logs offloaded via Kafka Tiered Storage or Pulsar's segmented storage to S3.

What is a sensible Flink checkpoint interval?

60 seconds with 30-second min-pause is a reasonable starting point for sub-minute recovery. 10-30 seconds for tight RTO; 5-10 minutes for very large state where checkpoint duration becomes the bottleneck. Always enable incremental checkpoints on RocksDB at non-trivial state size, externalize to durable object storage, retain on cancellation.

Data Engineer Hub

Streaming and Event Processing for Data Engineers: Kafka, Flink, CDC, and Exactly-Once (2026)

By Blake Crosley · Last verified 2026-04-30

In short

Streaming and event processing is the modern alternative to batch ETL. Apache Kafka is the de-facto distributed log: producers append to partitioned topics, consumer groups read with tracked offsets, partition keys preserve per-entity ordering. For stateful work (joins, windowed aggregations, sessionization) Apache Flink is the low-latency reference engine; Kafka Streams is the lighter library when the workload is Kafka-native. CDC via Debezium turns Postgres/MySQL row changes into Kafka events without dual-writes. Exactly-once is effectively-once end-to-end — idempotent producers, Kafka transactions, and Flink two-phase-commit checkpoints close the gap. Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming SQL materialized views.

Key takeaways

Apache Kafka is the canonical distributed log. Topics are partitioned for parallelism; order is per-partition, never global. Partition-key choice is the single most important design decision.
Consumer groups give parallelism with one-consumer-per-partition semantics. Offsets are the durable cursor — auto-commit is convenient and dangerous; commit explicitly after the side effect succeeds.
Apache Flink is the reference stateful streaming engine: event-time watermarks, RocksDB-backed keyed state, incremental checkpoints to S3/GCS. The choice for non-trivial joins, sessionization, sub-second SLOs.
Kafka Streams is a JVM library, not a cluster. Pick it for Kafka-native, JVM-only, simpler topologies; pick Flink for cross-source workloads or Python-led teams using PyFlink.
CDC with Debezium reads the database transaction log (Postgres logical replication, MySQL binlog) and emits per-row INSERT/UPDATE/DELETE events. The correct way to stream a relational database — never poll, never dual-write.
Exactly-once is effectively-once end-to-end. The broker may physically deliver twice; idempotent producers, Kafka transactions, and Flink's two-phase-commit sinks make the observable outcome single-application.
Watermarks govern late data; Kappa replaces Lambda; Materialize and RisingWave operationalize streaming materialized views. AWS Kinesis and Apache Pulsar are the credible alternatives to self-hosted Kafka.

Kafka 101: topics, partitions, consumer groups, offsets

Apache Kafka is a distributed append-only log. The Kafka documentation is the canonical reference; the operational model is worth internalizing before any framework on top of it.

Topic. A named stream divided into partitions — the unit of parallelism, ordering, and replication. A topic with 12 partitions can be read in parallel by up to 12 consumers in one group.
Partition. An ordered, immutable sequence of records. Each record has a monotonically-increasing 64-bit offset. Order is guaranteed within a partition only. Partitioning by a stable key (e.g. user_id) ensures per-entity ordering.
Consumer group. Kafka assigns each partition to exactly one consumer in the group. Partitions are the parallelism ceiling — additional consumers sit idle.
Offset. The consumer's cursor, stored in the internal __consumer_offsets topic. Auto-commit is the dangerous default; production systems commit explicitly after the side effect succeeds.
Durability. min.insync.replicas plus producer acks=all is the durability contract — both must be set or the cluster will silently accept writes that can be lost.

A minimal Python producer and consumer using confluent-kafka-python:

from confluent_kafka import Producer, Consumer

producer = Producer({
    "bootstrap.servers": "broker:9092",
    "acks": "all",
    "enable.idempotence": True,
})
producer.produce("orders", key="user-42", value=b'{"order_id": 1001}')
producer.flush()

consumer = Consumer({
    "bootstrap.servers": "broker:9092",
    "group.id": "orders-processor",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
})
consumer.subscribe(["orders"])
while True:
    msg = consumer.poll(1.0)
    if msg is None or msg.error():
        continue
    process(msg.value())
    consumer.commit(msg)  # commit only after success

enable.idempotence=True gives per-partition deduplication on retries via producer-id and sequence numbers. enable.auto.commit=False with explicit commit(msg) after successful processing is the at-least-once primitive most production pipelines build on.

Stateful streaming: Flink vs Kafka Streams

Stateless transforms (filter, map, project) are trivial. Stateful work — joins, windowed aggregations, sessionization, deduplication, top-K — requires per-key local state that survives restarts and rescaling. The two dominant choices are Apache Flink and Kafka Streams.

Apache Flink is a JVM-based clustered runtime. You compose a DAG of operators; Flink schedules tasks across TaskManagers coordinated by a JobManager. Each keyed operator has local state backed by RocksDB on TaskManager local disk, incrementally checkpointed to S3/GCS/HDFS.

Best-in-class event-time processing with watermarks, allowed lateness, side outputs, and savepoints for versioned state migration.
SQL via Flink SQL plus the Table API; PyFlink for Python teams (with UDF overhead through Py4J).
Sub-second end-to-end latency at scale; production at Stripe, Netflix, Uber, Pinterest at millions of events per second.
Operational cost is real. Most teams use a managed offering (AWS Managed Service for Apache Flink, Confluent Cloud Flink, Aiven, Ververica).

Kafka Streams is a Java/Scala library, not a cluster. Your application JVM hosts the topology; Kafka itself stores state-store changelogs and coordinates rebalancing. No separate runtime to operate.

Lower operational footprint — an ordinary JVM service. Scale by running more instances.
Tightly coupled to Kafka in / Kafka out. Cross-system integration requires Kafka Connect.
JVM-only. No production-grade Python or Go equivalent.
Topology complexity scales worse than Flink; recovery is slower because state is rebuilt by replaying the changelog topic.

Decision rule: Flink for multi-source workloads, sub-second SLOs, complex event-time semantics, or Python-led teams. Kafka Streams for Kafka in / Kafka out, JVM-native teams, single-digit-second budgets, when avoiding a separate clustered system has organizational value. Both pass the exactly-once test below; both run at scale at tier-1 companies. Fundamentals of Data Engineering (Reis & Housley, O'Reilly 2022) is the canonical book reference.

Change Data Capture (CDC) with Debezium

The wrong way to stream a relational database is to poll: SELECT * FROM orders WHERE updated_at > ? misses deletes, double-counts updates between polls, hammers the primary, and drifts under load. The right way is CDC: read the database's transaction log and emit one Kafka event per row mutation. Debezium is the open-source reference, deployed as Kafka Connect source connectors.

Debezium reads Postgres logical replication slots (pgoutput), MySQL binlog, SQL Server CDC tables, MongoDB oplog, or Oracle LogMiner. Each mutation becomes a Kafka message with a structured envelope: op (c/u/d/r), before and after images, source metadata (LSN, transaction id), and a tombstone for deletes when log compaction is enabled.

A representative Postgres connector configuration:

{
  "name": "orders-postgres-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "plugin.name": "pgoutput",
    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db.properties:password}",
    "database.dbname": "orders_prod",
    "topic.prefix": "orders_prod",
    "table.include.list": "public.orders,public.order_items",
    "slot.name": "debezium_orders",
    "publication.autocreate.mode": "filtered",
    "snapshot.mode": "initial",
    "tombstones.on.delete": "true",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter"
  }
}

Operational notes that catch new teams: replication slots can fill the WAL — if the connector pauses, Postgres retains WAL until the slot advances; monitor pg_replication_slots.confirmed_flush_lsn. snapshot.mode=initial reads every row of every included table as op=r; use schema_only when you have an alternative bulk-load path. Schema evolution is additive-safe via the schema registry; breaking changes require explicit coordination. One topic per table is the default — Flink stream-stream joins denormalize before publishing fact streams.

Exactly-once semantics — what it actually guarantees

Exactly-once is the most-marketed and most-misunderstood feature in distributed streaming. The honest definition: effectively-once end-to-end. The broker may physically deliver a message twice on retry; the system is constructed so the observable downstream effect is identical to single delivery. Three components compose the guarantee in the Kafka + Flink stack.

Idempotent producer. enable.idempotence=true assigns a producer-id and tracks per-partition sequence numbers; the broker deduplicates retries within a session. Local exactly-once for produce, not end-to-end.
Transactional producer. transactional.id plus initTransactions(), beginTransaction(), commitTransaction() wraps sends across partitions and topics atomically. Consumers with isolation.level=read_committed see only committed transactions. This is the read-process-write pattern: consume, transform, produce, commit offsets and records together.
Flink two-phase-commit checkpointing. TwoPhaseCommitSinkFunction coordinates external sinks with checkpoint barriers. Phase 1 pre-commits; phase 2 commits on global checkpoint completion. The Kafka sink is the canonical implementation; JDBC sinks need XA or application-level idempotency.

A minimal Flink Java configuration for exactly-once:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE);
CheckpointConfig cfg = env.getCheckpointConfig();
cfg.setMinPauseBetweenCheckpoints(30_000);
cfg.setCheckpointTimeout(120_000);
cfg.setMaxConcurrentCheckpoints(1);
cfg.setExternalizedCheckpointCleanup(
    CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
// sink: KafkaSink.builder().setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE) ...

What exactly-once does not give you: protection against non-idempotent external side effects outside the checkpoint barrier (emails, credit-card charges, third-party REST calls — those need application-level idempotency keys); improved throughput (transactions add latency); freedom from thinking about replay, savepoints, and schema evolution. Practical rule: turn it on for financial, billing, and audit-grade pipelines; run at-least-once for analytics where downstream tables dedupe by key anyway.

Watermarks, Kappa, materialized views, and the Kafka alternatives

Watermarks. Processing-time is easy and wrong for analytics; event-time is correct. A watermark of timestamp T asserts no earlier event will arrive. Windows emit when the watermark crosses the end; allowed lateness extends retention; later events go to a side output. WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofMinutes(5)) is the practical default — bound from observed delay, not optimism.

Kappa. Jay Kreps's 2014 essay introduced Kappa as the alternative to Marz's Lambda. Lambda's complaint is two implementations of the same business logic; Kappa runs one streaming pipeline and replays the log to recompute. Now the default at streaming-first orgs.

Streaming materialized views. Materialize (Timely Dataflow) and RisingWave (Rust, PostgreSQL-wire-compatible) operationalize Kappa: write standard SQL, get incrementally-maintained views over streams. Younger than Flink; appropriate when the workload fits SQL and latency budget is in seconds.

Kafka alternatives. AWS Kinesis is the AWS-native managed option (shards = partitions); Firehose sinks to S3/Redshift; Managed Service for Apache Flink runs jobs. Operationally simpler, lower per-shard throughput, smaller connector ecosystem. Apache Pulsar is the open-source alternative — segmented BookKeeper storage, native multi-tenancy, built-in geo-replication, tiered storage to S3, adopted at Yahoo, Tencent, Splunk. Right when multi-tenancy or geo-replication is first-order; Kafka remains the broader-ecosystem default.

Frequently asked questions

When should I pick Kafka Streams over Flink?: Kafka Streams when the workload is Kafka in / Kafka out, the team is JVM-native, latency budgets are single-digit seconds, and avoiding a separate clustered runtime has organizational value. Flink when you have multiple sources (Kafka plus CDC plus REST), need sub-second SLOs, require complex event-time semantics, or the team is Python-led on PyFlink.
Is exactly-once semantics actually exactly-once?: No. The accurate name is effectively-once end-to-end. The broker may physically redeliver on retry; the observable downstream effect is single-application via idempotent producers, transactional sends, and Flink's two-phase-commit sinks. Effects outside the transaction boundary — REST calls, emails, charges — still need application-level idempotency keys.
How do I choose a partition count and partition key?: Partition count is the parallelism ceiling — you cannot read with more consumers per group than partitions. Start with 2-3x peak expected consumer count, ~10 MB/s per partition; common defaults 12, 24, 48. Adding partitions later breaks per-key ordering, so over-provision modestly. Always pick a stable, high-cardinality partition key (user_id, account_id, device_id) — keyless records go round-robin and break stream-stream joins.
How long should I retain a Kafka topic for Kappa-style replay?: The max of your longest expected backfill window and the longest downstream-incident recovery time. Typical production: 7 days for high-volume operational topics, 30-90 days for analytical topics, infinite for source-of-truth event-sourcing logs offloaded via Kafka Tiered Storage or Pulsar's segmented storage to S3.
What is a sensible Flink checkpoint interval?: 60 seconds with 30-second min-pause is a reasonable starting point for sub-minute recovery. 10-30 seconds for tight RTO; 5-10 minutes for very large state where checkpoint duration becomes the bottleneck. Always enable incremental checkpoints on RocksDB at non-trivial state size, externalize to durable object storage, retain on cancellation.

Sources

About the author. Blake Crosley founded ResumeGeni and writes about data engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.