Do I need to know Paxos as well as Raft?

Conceptually yes; implementation no. Paxos is what older academic papers reference (Lamport 1998 'The Part-Time Parliament'); Raft is what production systems use. You should know that they're equivalent in capability, that Raft has a leader and Paxos doesn't (Multi-Paxos does), and that nobody's implemented basic Paxos in production because of its complexity. If an interviewer wants Paxos depth, ask which variant — Basic Paxos and Multi-Paxos are different beasts.

When is eventual consistency actually safe in production?

When the workload tolerates stale reads and conflicts are infrequent or auto-mergeable. Concrete examples: shopping cart (last-writer-wins on add-to-cart is fine), social-media timeline (a tweet appearing 200ms late is fine), DNS (eventual is the design). Concrete counterexamples: bank balance, inventory reservation, session token. The litmus test: 'what's the worst-case impact of a 5-second stale read?' If it's user-visible incorrectness, you need stronger consistency.

What's the difference between strong consistency and linearizability?

Linearizability is the strongest single-object consistency model: every operation appears to take effect at some single instant between its invocation and response, in a total order consistent with real time. 'Strong consistency' is a colloquial term often meaning linearizability but sometimes meaning 'serializable transactions' (a multi-object property) or just 'no stale reads'. DDIA ch. 9, p. 324-336, distinguishes them carefully. Use 'linearizability' when precision matters.

Is 2PC really fully deprecated?

Not entirely. PostgreSQL's PREPARE TRANSACTION is 2PC; it's used inside single Postgres clusters with reliable infrastructure. What's deprecated is 2PC across heterogeneous services with independent failure domains. The replacement at that layer is sagas with compensating transactions and idempotent operations. If the interviewer says '2PC across microservices', the right answer is 'I'd use a saga'.

How do CRDTs (conflict-free replicated data types) fit in?

CRDTs are the right tool when you need leaderless replication AND auto-merging without application logic. They guarantee that concurrent updates from any nodes will converge to the same state. Used in: Riak's data types, Redis Enterprise's CRDB, Figma's multiplayer, Linear's local-first sync. Trade-off: not all data types fit (counters, sets, maps yes; ordered lists with arbitrary insert/delete are hard). Reference: Marc Shapiro et al, 'A Comprehensive Study of Convergent and Commutative Replicated Data Types' (hal.inria.fr/inria-00555588).

What's the production answer for 'distributed lock'?

Three real options. (1) Redis SETNX with TTL — fast, but Aphyr documented Redlock failure modes (martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). Not safe for correctness-critical work. (2) Zookeeper or etcd ephemeral nodes with sequence numbers — the canonical solution. (3) Database row lock with leader-elected service — works for low-throughput cases. The honest answer in interviews: 'use etcd or Zookeeper for correctness-critical locks; Redis for performance-critical locks where occasional duplicate is acceptable'.

How do I prepare for distributed-systems questions at companies like Databricks or MongoDB?

Beyond DDIA, study the company's own engineering blog. Databricks ( databricks.com/blog ) has deep Spark internals posts. MongoDB engineering blog ( mongodb.com/blog ) covers their replication and sharding architecture. For Stripe, study their API stability posts and the Pingdom-style outage reports. The senior+ bar at distributed-systems-specialty companies is a level deeper than at general SWE roles.

What's the right answer to 'how do you handle clock skew in distributed systems?'

Three layers. (1) Use NTP/PTP for time sync, accept ~1-10ms skew. (2) For ordering events, use logical clocks (Lamport timestamps) or vector clocks rather than wall-clock time. (3) For Spanner-style external consistency, use TrueTime — atomic clocks plus GPS plus uncertainty intervals. The canonical interview pitfall: using System.currentTimeMillis() for ordering across machines. Reference: Jepsen's 'When does reset really matter?' analyses for the production failure modes.

Software Engineer Hub

Distributed Systems for Software Engineers (2026)

By Blake Crosley · Last verified 2026-04-29

In short

Distributed systems is what separates a senior coding-interview pass from a senior+ system-design pass. The bar in 2026 is concrete: you can walk through Raft leader election step-by-step, explain why CAP theorem is more nuanced than 'pick two', and name the partitioning strategies that DynamoDB, Cassandra, and Spanner each use and why. This page works through one CAP scenario end-to-end, walks Raft consensus on a 5-node cluster, and contrasts the three production patterns with the DDIA chapter where Kleppmann names the trade-offs.

Key takeaways

CAP theorem 'pick two' is misleading — partitions happen, so you're really picking between consistency-during-partition (CP) and availability-during-partition (AP). DDIA ch. 9, p. 336-342, walks why Brewer himself walked the framing back.
Raft is what to learn (raft.github.io/raft.pdf) — Paxos is what to recognize. Raft has worked production implementations in etcd, Consul, CockroachDB, and TiKV; the algorithm fits in a 14-page paper.
DynamoDB uses consistent hashing with virtual nodes (Dynamo paper, sosp.org/2007 — Section 4.2); Cassandra uses the same; Spanner uses range partitioning over Paxos groups with TrueTime atomic clocks (Spanner paper, OSDI 2012, sec. 3).
Quorum reads/writes (R + W > N) give eventual or strong consistency; Cassandra defaults to W=ONE for performance, which loses durability on single-node failure mid-write (DDIA ch. 5, p. 179-182).
Network partitions are not theoretical — Aphyr's Jepsen analyses (jepsen.io/analyses) document real partition behavior on every major distributed database; reading two of them is a senior+ differentiator.
Two-phase commit (2PC) is widely deprecated in modern systems; Saga pattern with compensating transactions is the 2026 standard for cross-service consistency (Garcia-Molina & Salem 1987 SAGAS paper, plus Chris Richardson's microservices.io/patterns/data/saga.html).

CAP theorem with a worked example: order-placement service

The 'pick two' framing is the first thing every interview candidate says and the first thing senior+ interviewers grade poorly. Eric Brewer himself walked the framing back in his 2012 IEEE article 'CAP Twelve Years Later' (computer.org/csdl/magazine/co/2012/02). The honest framing is: during a network partition, you pick consistency or availability; outside partition, you can have both, plus low latency.

Worked scenario: a multi-region order service.

You run an order-placement service in two regions, US-East and US-West, with synchronous replication between them. A network partition isolates US-West for 30 seconds.

Before partition:        After partition (US-West isolated):
[US-East]<==sync==>[US-West]   [US-East]   X   [US-West]
      |                  |             |             |
   [order DB]       [order DB]    [order DB]    [order DB]

CP choice (favor consistency): US-West rejects writes during the partition. Riders in California get 503 errors. The system is correct (no divergence) but unavailable in the West. This is what Spanner does with Paxos quorum: a write needs a majority of replicas; partitioned minorities can't write.

AP choice (favor availability): US-West accepts writes locally and reconciles when the partition heals. Two riders may book the last seat on the same flight; you get duplicate orders that need post-hoc resolution (last-writer-wins, or vector clocks + application-level merge). This is what Dynamo / Cassandra do.

The senior+ answer to 'CAP for an order service':

'Order placement is CP. Duplicate orders are worse than 30 seconds of unavailability — we'd rather a customer retry than have to reconcile two oversold orders. Reads of orders (status checks) are AP — slightly stale data is fine. So the system is mixed-consistency: writes route through a leader (CP), reads can hit any replica (AP). This is what CockroachDB and Spanner do; it's the right pattern for transactional workloads.'

Reference: DDIA ch. 9, p. 336-342, on the CAP confusion; Daniel Abadi's PACELC extension (cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf) which adds the 'else' clause: 'in absence of partition, choose between latency and consistency'. PACELC is the better mental model.

Raft consensus walked through on a 5-node cluster

Raft (raft.github.io/raft.pdf) is the consensus algorithm to learn deeply. etcd, Consul, CockroachDB, TiKV, and RethinkDB all use Raft in production. Paxos is theoretically equivalent but is what gets cited in academic papers; Raft is what gets implemented.

Setup: 5 nodes (N1..N5). One is leader; others are followers.

Step 1 — Initial state. N1 is leader. All 5 nodes have committed log entries [1: SET x=1, 2: SET y=2]. Term = 4.

N1 (leader, term=4)  [1:SET x=1, 2:SET y=2]
N2 (follower)        [1:SET x=1, 2:SET y=2]
N3 (follower)        [1:SET x=1, 2:SET y=2]
N4 (follower)        [1:SET x=1, 2:SET y=2]
N5 (follower)        [1:SET x=1, 2:SET y=2]

Step 2 — Client write. Client sends SET z=3 to N1.

N1 appends [3: SET z=3] to its local log (uncommitted).
N1 sends AppendEntries RPCs to N2..N5 with the new entry.
N2, N3, N4 ack; N5 is slow but will catch up.
N1 has a majority (itself + 3 = 4 of 5). It commits the entry locally and applies to state machine.
Client gets ack.
On the next AppendEntries (heartbeat), N1 tells followers 'commitIndex = 3'; they apply the entry to their state machines.

Step 3 — Leader fails. N1 crashes. Heartbeat timeouts (200ms by default in etcd) fire on N2..N5.

N3's election timer (randomized 150-300ms) fires first. N3 increments term to 5, votes for itself, sends RequestVote RPCs.
N2, N4, N5 all check: is N3's log at least as up-to-date as ours? Yes (everyone has the same log). Grant vote.
N3 has 4 votes (majority). Becomes leader for term 5.
N3 sends heartbeats. New leader is established within ~300-500ms total.

Step 4 — N1 recovers. N1 comes back online with term=4. N3's heartbeat arrives with term=5. N1 sees term > its own; steps down to follower; updates term to 5. Cluster is back to 5 nodes.

The split-vote scenario (why an even number of nodes is dangerous): in a 4-node cluster after leader failure, two nodes could vote for one candidate and two for another. No majority; no leader. The election retries with random timeouts until one wins. With 4 nodes, this can take multiple rounds; with 3 or 5 nodes, the math guarantees a majority on first try most of the time. Always size Raft clusters at 3, 5, or 7 nodes — odd numbers, never even.

Reference implementation to read: etcd's raft package (github.com/etcd-io/raft). 4000 lines of Go, well-commented, used in Kubernetes for cluster state. Reading the step function in raft.go is the highest-leverage hour of distributed-systems study.

Partitioning strategies: Spanner, DynamoDB, Cassandra compared

The three production patterns to know cold. DDIA chapter 6 (p. 199-228) walks all three; here's the senior+ summary with the tactical decision-tree.

	DynamoDB	Cassandra	Spanner
Partitioning	Consistent hash with virtual nodes	Consistent hash with virtual nodes (token ring)	Range partitioning by primary key prefix
Consistency	Eventual by default, strong with consistent_read=true	Tunable: ONE / QUORUM / ALL	External consistency via TrueTime
Replication	3-way within region; cross-region async	Configurable replication factor; multi-DC	Paxos-replicated within zone; cross-zone via leader
Failure model	AP-leaning	Tunable; defaults AP	CP via Paxos
Range queries	Within partition only (scan with limits)	Within partition only	Native (the killer feature)
Use when	Key-value workload, simple access patterns	Write-heavy, multi-DC, tunable consistency	Strong consistency at global scale, range queries

Consistent hashing — the production technique used in DynamoDB and Cassandra. Naive hash partitioning (hash(key) % N) breaks when you add a node — every key remaps. Consistent hashing maps both nodes and keys onto a ring; adding a node only remaps keys near the new position.

# Pseudocode
class ConsistentHashRing:
    def __init__(self, nodes, vnodes_per_node=150):
        self.ring = {}  # hash_position -> node
        for node in nodes:
            for v in range(vnodes_per_node):
                pos = hash(f"{node}:{v}")
                self.ring[pos] = node
        self.sorted_positions = sorted(self.ring)

    def get_node(self, key):
        pos = hash(key)
        # Find first position >= pos in the ring (wrap around)
        idx = bisect.bisect(self.sorted_positions, pos) % len(self.sorted_positions)
        return self.ring[self.sorted_positions[idx]]

Virtual nodes (vnodes) prevent hot-spots when nodes have heterogeneous capacity. The Dynamo paper (Section 4.2, sosp.org/2007/papers/dynamo.pdf) explains why naive consistent hashing fails on heterogeneous clusters and how vnodes fix it.

Range partitioning — the Spanner / Bigtable pattern. Keys are sorted by primary key prefix; tablets (Bigtable) or splits (Spanner) cover contiguous key ranges. Splits are added or merged dynamically based on load.

Hot-spot mitigation:

Salt the prefix: if you partition by timestamp, prefix with a hash bucket so writes scatter (e.g., {0..9}#{timestamp} instead of {timestamp}).
Composite keys: in DynamoDB, use a partition key that distributes well (user_id) and a sort key for ordering (event_time).
Tablet splitting: Spanner splits a tablet when it exceeds 4GB (per the Spanner paper); the rebalancer redistributes splits across nodes.

The Foursquare 2010 outage (blog.foursquare.com/post/1346372637) is the canonical 'why range-based partitioning on monotonic keys fails' case study. They sharded MongoDB by user_id which grew monotonically; one shard hot-spotted to 200% memory utilization; the cluster failed over and chained the failure across all shards. 11-hour outage. Always cite this when an interviewer asks about range partitioning.

Replication: leader-follower vs leaderless, with quorum math

Three replication patterns. DDIA chapter 5, p. 152-184, is the canonical reference; this is the operational summary.

1. Leader-follower (Postgres, MySQL, MongoDB single-replica-set): writes go to leader; followers replicate (async or sync). Read-after-write consistency is easy on the leader; eventual on followers.

Client writes -> [Leader] --replicates--> [Follower 1]
                                            -> [Follower 2]
Client reads (latest) -> Leader
Client reads (eventual) -> any follower

Failure mode: leader crashes, followers must elect new leader. If replication was async, the new leader might be missing the last few committed writes. Data loss without explicit ack mode.

2. Multi-leader (CouchDB, multi-region MySQL with Galera): multiple leaders, each accepts writes. Replication is async; conflicts can occur and need application-level resolution (last-write-wins, CRDTs, or manual reconciliation). Used when low write latency in multiple regions matters more than consistency.

3. Leaderless (Dynamo, Cassandra, Riak): any replica can accept writes; reads query multiple replicas and pick the most recent (with timestamps or vector clocks). Quorum: writes go to W replicas, reads query R replicas; R + W > N guarantees overlap.

# Cassandra-style: N=3 replicas
W=ONE, R=ONE      -> eventual consistency, low latency
W=QUORUM, R=QUORUM -> W=2, R=2; overlap guaranteed; strong-ish
W=ALL, R=ONE       -> strong reads, low write availability (any one node down = no writes)

The hidden trap: Cassandra defaults to W=ONE for performance. A write that succeeds on one node, then that node crashes before replicating, is lost. For durability, set W=QUORUM at minimum. Aphyr's Jepsen analysis of Cassandra (jepsen.io/analyses/cassandra) walks two real failure modes; if you're using Cassandra in production, this is required reading.

Sloppy quorums and hinted handoff (Dynamo's failure-tolerance trick): if a 'home' replica is down, write to a temporary replica; when home recovers, the temp replica hands off the data. Improves availability under partial failure but breaks the R + W > N guarantee briefly.

Cross-service consistency: 2PC, Sagas, and the modern pattern

For cross-service transactions (e.g., 'place order' touches inventory + payments + shipping), two-phase commit (2PC) was the textbook answer for 30 years and is largely deprecated in 2026.

Why 2PC failed in production: the coordinator is a single point of failure. If the coordinator crashes after phase 1 (prepare) but before phase 2 (commit), participants are stuck holding locks indefinitely. DDIA ch. 9, p. 354-360, walks the failure mode in detail.

The Saga pattern (Garcia-Molina & Salem, 1987 — cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf) is the modern replacement. A long-running transaction is decomposed into a sequence of local transactions, each with a compensating action.

# Order placement saga
T1: Reserve inventory   (compensate: release inventory)
T2: Charge payment      (compensate: refund payment)
T3: Ship order          (compensate: cancel shipment)

If T2 fails: run compensate-T1 (release inventory).
If T3 fails: run compensate-T2, then compensate-T1.

Two saga implementation patterns:

Choreography (event-driven): each service emits events; downstream services react. No coordinator. Hard to track what stage the saga is in.
Orchestration (workflow engine): a coordinator (Temporal, AWS Step Functions, Camunda) drives the saga. Easier to debug; introduces a coordination layer.

Production reference: Temporal (github.com/temporalio/temporal) is the dominant orchestration engine in 2026. Used by Stripe, Snap, and Uber for cross-service workflows.

The honest senior+ answer when an interviewer asks 'how do you ensure cross-service consistency?':

'Sagas with compensating transactions, orchestrated by Temporal or a similar workflow engine. Idempotent operations on each participant. Outbox pattern (microservices.io/patterns/data/transactional-outbox.html) for the event publishing — write event to local DB transactionally with the state change; an async worker publishes from the outbox. Avoid 2PC; the coordinator-failure case is unsolved.'

What to study and in what order

Empirical 12-week curriculum for senior+ distributed-systems interview readiness:

Weeks 1-3 — DDIA chapters 1-7 (Kleppmann). The foundational text; read it twice if you can. Make notes on partitioning (ch. 6), replication (ch. 5), transactions (ch. 7).
Weeks 4-5 — Raft paper + etcd implementation. Read the Raft paper (raft.github.io/raft.pdf, 14 pages); read the step function in etcd's raft.go.
Weeks 6-7 — Three foundational papers. Spanner (research.google/pubs/spanner-googles-globally-distributed-database-2/), Dynamo (sosp.org/2007/papers/dynamo.pdf), Bigtable (research.google/pubs/bigtable). Read for the trade-off articulation, not the implementation detail.
Weeks 8-9 — Aphyr's Jepsen analyses. Read jepsen.io/analyses for at least three databases (Cassandra, MongoDB, etcd). The 'consistency-during-partition' failure modes are what interviews probe.
Weeks 10-12 — Practical implementation. Build a toy distributed key-value store using Raft (etcd's raft library is the easiest path). MIT 6.824 labs (pdos.csail.mit.edu/6.824) are the canonical hands-on; the 4 labs cover MapReduce, Raft, fault-tolerant KV, sharded KV.

The book if you only read one: DDIA. Kleppmann's coverage is what every senior+ FAANG interview implicitly assumes. Reading it twice with notes is the highest-leverage investment.

The course if you take one: MIT 6.824 'Distributed Systems' (pdos.csail.mit.edu/6.824). Free online, video lectures, four substantive labs in Go. Worked-shipped completion of lab 2 (Raft) puts you ahead of ~90% of senior candidates.

Frequently asked questions

Do I need to know Paxos as well as Raft?: Conceptually yes; implementation no. Paxos is what older academic papers reference (Lamport 1998 'The Part-Time Parliament'); Raft is what production systems use. You should know that they're equivalent in capability, that Raft has a leader and Paxos doesn't (Multi-Paxos does), and that nobody's implemented basic Paxos in production because of its complexity. If an interviewer wants Paxos depth, ask which variant — Basic Paxos and Multi-Paxos are different beasts.
When is eventual consistency actually safe in production?: When the workload tolerates stale reads and conflicts are infrequent or auto-mergeable. Concrete examples: shopping cart (last-writer-wins on add-to-cart is fine), social-media timeline (a tweet appearing 200ms late is fine), DNS (eventual is the design). Concrete counterexamples: bank balance, inventory reservation, session token. The litmus test: 'what's the worst-case impact of a 5-second stale read?' If it's user-visible incorrectness, you need stronger consistency.
What's the difference between strong consistency and linearizability?: Linearizability is the strongest single-object consistency model: every operation appears to take effect at some single instant between its invocation and response, in a total order consistent with real time. 'Strong consistency' is a colloquial term often meaning linearizability but sometimes meaning 'serializable transactions' (a multi-object property) or just 'no stale reads'. DDIA ch. 9, p. 324-336, distinguishes them carefully. Use 'linearizability' when precision matters.
Is 2PC really fully deprecated?: Not entirely. PostgreSQL's PREPARE TRANSACTION is 2PC; it's used inside single Postgres clusters with reliable infrastructure. What's deprecated is 2PC across heterogeneous services with independent failure domains. The replacement at that layer is sagas with compensating transactions and idempotent operations. If the interviewer says '2PC across microservices', the right answer is 'I'd use a saga'.
How do CRDTs (conflict-free replicated data types) fit in?: CRDTs are the right tool when you need leaderless replication AND auto-merging without application logic. They guarantee that concurrent updates from any nodes will converge to the same state. Used in: Riak's data types, Redis Enterprise's CRDB, Figma's multiplayer, Linear's local-first sync. Trade-off: not all data types fit (counters, sets, maps yes; ordered lists with arbitrary insert/delete are hard). Reference: Marc Shapiro et al, 'A Comprehensive Study of Convergent and Commutative Replicated Data Types' (hal.inria.fr/inria-00555588).
What's the production answer for 'distributed lock'?: Three real options. (1) Redis SETNX with TTL — fast, but Aphyr documented Redlock failure modes (martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). Not safe for correctness-critical work. (2) Zookeeper or etcd ephemeral nodes with sequence numbers — the canonical solution. (3) Database row lock with leader-elected service — works for low-throughput cases. The honest answer in interviews: 'use etcd or Zookeeper for correctness-critical locks; Redis for performance-critical locks where occasional duplicate is acceptable'.
How do I prepare for distributed-systems questions at companies like Databricks or MongoDB?: Beyond DDIA, study the company's own engineering blog. Databricks (<a href="https://www.databricks.com/blog" rel="noopener" target="_blank">databricks.com/blog</a>) has deep Spark internals posts. MongoDB engineering blog (<a href="https://www.mongodb.com/blog" rel="noopener" target="_blank">mongodb.com/blog</a>) covers their replication and sharding architecture. For Stripe, study their API stability posts and the Pingdom-style outage reports. The senior+ bar at distributed-systems-specialty companies is a level deeper than at general SWE roles.
What's the right answer to 'how do you handle clock skew in distributed systems?': Three layers. (1) Use NTP/PTP for time sync, accept ~1-10ms skew. (2) For ordering events, use logical clocks (Lamport timestamps) or vector clocks rather than wall-clock time. (3) For Spanner-style external consistency, use TrueTime — atomic clocks plus GPS plus uncertainty intervals. The canonical interview pitfall: using <code>System.currentTimeMillis()</code> for ordering across machines. Reference: Jepsen's 'When does reset really matter?' analyses for the production failure modes.

Sources

About the author. Blake Crosley founded ResumeGeni and writes about product design, hiring technology, and ATS optimization. More writing at blakecrosley.com.