Backend Engineer Hub

Distributed Systems and Consistency for Backend Engineers

In short

Distributed-systems fluency is what separates senior backend engineers from mid-level ones. You need to articulate CAP trade-offs with real systems (Spanner is CP, DynamoDB is AP), explain Raft state changes from memory, and pick between two-phase commit and Saga based on coupling and failure tolerance. You also need to reason about ordering with Lamport timestamps and vector clocks, know when CRDTs replace coordination, and clearly distinguish linearizability (single object, real-time order) from serializability (multi-object transactions). Interviewers test all of this.

Key takeaways

  • CAP forces a network-partition choice: Spanner picks CP via TrueTime + Paxos, DynamoDB picks AP via leaderless quorum, CockroachDB picks CP with Raft per range.
  • Raft is consensus made teachable: leader election by term, log replication by AppendEntries, safety through the Election Restriction.
  • Two-phase commit blocks on coordinator failure; Saga trades atomicity for availability via idempotent compensating actions.
  • Lamport timestamps give a total order without physical clocks; vector clocks detect causality and concurrent updates.
  • CRDTs (G-Counter, OR-Set, RGA) let replicas merge without coordination, useful for collaborative editing and offline-first apps.
  • Linearizability is a single-object real-time guarantee; serializability is a multi-object transaction guarantee. Spanner offers both via external consistency.
  • Strong consistency costs latency and availability; eventual consistency costs reasoning effort and bug surface area.

Why distributed-systems fluency dominates senior+ interviews

Backend interviews at senior, staff, and principal levels stop being about CRUD and start being about what happens when the network breaks. The reason is structural: every interesting backend system in 2026 is distributed. Microservices, multi-region databases, message queues, edge caches, and serverless platforms all fail in ways that single-machine systems do not. If you cannot reason about partitions, ordering, and consensus, you cannot design these systems and you cannot debug them.

The shift shows up in interview rubrics. System design rounds for L5 and above explicitly probe consistency model selection, partition behavior, and failure-mode reasoning. AWS's Builders' Library documents how their interviewers expect candidates to discuss leader election, fencing tokens, and split-brain. Google's Spanner paper is required reading for anyone interviewing for an L5+ infra role. Interviewers treat Designing Data-Intensive Applications Chapter 9 as table stakes.

The deeper reason: distributed-systems thinking is the most expensive skill to develop on the job. You either learn it by studying primary sources, or you learn it the hard way by being on call for an outage caused by a stale read or a network partition. Companies pay for the first kind of learning.

The CAP-theorem trade-offs that actually matter in 2026

Brewer's CAP theorem says that during a network partition, a distributed system can guarantee either consistency (every read sees the latest write) or availability (every request gets a response), but not both. CAP is often misquoted as a steady-state choice; it is really a partition-time choice. Most production systems are AP or CP under partition and CA otherwise.

The three reference points worth memorizing for interviews:

SystemChoiceMechanismUse when
Google SpannerCP, globallyPaxos + TrueTime atomic clocks; commit-wait absorbs clock uncertaintyFinancial ledgers, global inventory, anything where stale reads are unacceptable
Amazon DynamoDBAP, regionallyLeaderless replication, sloppy quorum, hinted handoff; eventual consistency by default with optional strong readsHigh-write shopping carts, session stores, IoT telemetry
CockroachDBCP, globallyRaft per range, hybrid logical clocks (HLC), serializable isolationOLTP workloads needing Postgres compatibility plus geo-distribution

The worked decision matrix interviewers want to hear: if your business cannot tolerate two customers buying the last item, you need CP; the cost is that during a regional partition, the minority side rejects writes. If your business cannot tolerate 503s on the cart endpoint, you need AP; the cost is that you ship code to detect and reconcile concurrent writes (vector clocks, last-writer-wins, or CRDTs). If you say "we'll just use Spanner everywhere" you are wrong: Spanner's commit-wait adds 5-10ms per write even when nothing is partitioned, and your single-region service does not need that.

The practical rule from AWS Builders' Library: pick your consistency model from the business requirement, not the database menu. Then pick a database that supports it.

Consensus algorithms in plain text

Consensus is how a group of replicas agrees on a single value despite failures. Paxos is the canonical algorithm; Raft is the version humans can actually implement. Raft decomposes consensus into three subproblems: leader election, log replication, and safety.

Each Raft node is in one of three states: Follower, Candidate, or Leader. Time is divided into terms; each term has at most one leader. A follower that does not hear from a leader within an election timeout becomes a candidate, increments its term, votes for itself, and asks peers for votes. A candidate that wins a majority becomes leader and starts sending AppendEntries RPCs (which double as heartbeats). The Election Restriction guarantees that a candidate must have all committed log entries to win, which preserves safety.

# Raft state-machine pseudocode (simplified)
on election_timeout():
    state = CANDIDATE
    current_term += 1
    voted_for = self.id
    votes = 1
    for peer in peers:
        send RequestVote(term=current_term,
                         last_log_index=log.last_index(),
                         last_log_term=log.last_term())

on RequestVote(term, last_log_idx, last_log_term) from candidate:
    if term < current_term: reply False
    if voted_for in (None, candidate) and \
       candidate_log_at_least_as_up_to_date(last_log_idx, last_log_term):
        voted_for = candidate
        reply True

on AppendEntries(term, prev_idx, prev_term, entries, leader_commit):
    if term < current_term: reply False
    reset_election_timer()
    if log[prev_idx].term != prev_term: reply False  # consistency check
    log.append(entries)
    commit_index = min(leader_commit, log.last_index())
    apply_committed_entries_to_state_machine()
    reply True

The interview move: when asked "how does Raft pick a leader?" do not recite the paper. Walk through term increments, the split-vote possibility, randomized election timeouts (typically 150-300ms), and why the Election Restriction prevents committed entries from being lost. Mention that etcd, Consul, CockroachDB, TiDB, and MongoDB's replica sets all use Raft variants.

Distributed-transaction patterns

When a transaction spans services or shards, you need a coordination protocol. The two real choices are two-phase commit (2PC) and the Saga pattern.

Two-phase commit uses a coordinator. Phase 1 (prepare): coordinator asks all participants to vote; each writes its intent durably and replies yes or no. Phase 2 (commit): if all voted yes, coordinator tells everyone to commit; otherwise tells everyone to abort. 2PC gives you atomicity, but it is blocking: if the coordinator crashes after participants vote yes but before they hear the decision, participants hold locks indefinitely. This is why 2PC is rare in microservices and common only inside a single database engine (Postgres prepared transactions, XA).

Sagas trade atomicity for availability. A long-running business transaction is decomposed into local transactions, each with a compensating action that undoes it. If step 4 fails, you run compensations for steps 3, 2, 1 in reverse. Sagas come in two flavors: orchestration (a central service drives the steps; easier to debug) and choreography (services emit events and react; lower coupling but harder to trace).

# Saga orchestrator with idempotent compensating actions
class BookTripSaga:
    def run(self, trip_id):
        try:
            flight = self.book_flight(trip_id)        # step 1
            hotel  = self.book_hotel(trip_id)         # step 2
            car    = self.book_car(trip_id)           # step 3
            self.charge_payment(trip_id)              # step 4
        except StepFailed as e:
            # Compensations run in reverse, must be idempotent
            self.refund_payment(trip_id)              # safe if no charge
            self.cancel_car(trip_id)                  # safe if no booking
            self.cancel_hotel(trip_id)
            self.cancel_flight(trip_id)
            raise

    def cancel_hotel(self, trip_id):
        # Idempotency key prevents double-cancel under retry
        self.hotel_api.cancel(trip_id, idempotency_key=f"cancel-{trip_id}")

The non-negotiable Saga rule: every compensating action must be idempotent, because retries are a fact of distributed life. Use idempotency keys, conditional updates, or version checks. Never rely on "we'll only call it once."

Lamport timestamps round out the toolkit. They give you a total order without synchronized clocks: every event gets a counter; when sending a message, include your counter; on receive, set local = max(local, received) + 1. Vector clocks generalize this to detect concurrency: each node tracks a vector of seen counters per peer, and you can compare two events to determine if one happens-before the other or if they are concurrent.

# Lamport timestamp algorithm
class LamportClock:
    def __init__(self): self.t = 0
    def local_event(self):
        self.t += 1
        return self.t
    def send(self):
        self.t += 1
        return self.t  # piggyback on the message
    def receive(self, msg_t):
        self.t = max(self.t, msg_t) + 1
        return self.t

CRDTs (Conflict-free Replicated Data Types) sidestep coordination entirely. A G-Counter is a vector of per-node counters; merge takes the element-wise max. An OR-Set tracks adds and removes with unique tags so concurrent add/remove resolves deterministically. CRDTs power Figma's multiplayer cursors, Riak's distributed datatypes, and Automerge-based collaboration tools. They are not a free lunch: metadata grows, and not every operation has a CRDT formulation.

Finally, the consistency-model distinction interviewers love: linearizability is a guarantee about a single object: every operation appears to take effect atomically at some point between its invocation and response, in real-time order. Serializability is a guarantee about transactions: the result is equivalent to some serial execution. Spanner provides both simultaneously and calls the combination external consistency. Postgres in default isolation gives you neither across a cluster; you need Serializable isolation plus a single-node database, or a system like Spanner or CockroachDB.

Frequently asked questions

What is the difference between linearizability and serializability?
Linearizability is a single-object, real-time guarantee: a read always returns the most recent write per wall-clock order. Serializability is a multi-object transaction guarantee: the outcome equals some serial schedule, but transactions may be reordered. Spanner offers both as external consistency; many SQL databases offer serializability without linearizability across replicas.
When should I use a Saga instead of two-phase commit?
Use Saga whenever the transaction crosses service boundaries or runs longer than a few hundred milliseconds. 2PC blocks under coordinator failure and holds locks across services, which is unacceptable for user-facing flows. Saga keeps each step local, requires idempotent compensations, and tolerates partial failure. Use 2PC only inside a single database engine where the coordinator is the engine itself.
Is Raft really better than Paxos, or is it just easier to teach?
Both achieve the same safety and liveness properties. Raft restricts the design space (strong leader, contiguous log) to make implementation tractable; classic Multi-Paxos allows more parallelism but is notoriously hard to implement correctly. In practice, Raft is the right default for new systems, which is why etcd, Consul, CockroachDB, TiDB, and MongoDB chose it.
Why does Spanner add commit-wait latency even when there is no partition?
TrueTime gives Spanner a bounded clock-uncertainty interval (typically 1-7ms). To guarantee external consistency, a write must wait out that uncertainty before acknowledging, so a later transaction's timestamp is provably greater. The cost is per-write latency; the benefit is that linearizability holds globally without coordination per read.
When are CRDTs the wrong choice?
When you need strong consistency or transactional invariants across multiple objects, CRDTs cannot help, you need consensus or a coordinator. CRDTs also accumulate metadata (tombstones in OR-Sets, version vectors), which grows with operation count. They shine for collaborative editing, presence, counters, and offline-first sync where availability and merge convergence matter more than strict ordering.
How do I pick between strong and eventual consistency?
Ask whether a stale read can produce a wrong business outcome. Inventory, balances, and authorization need strong consistency. Feeds, recommendations, analytics, and presence are fine eventually consistent. The cost of strong consistency is latency and availability under partition; the cost of eventual consistency is reasoning effort and bug surface area in your application code.
What ordering guarantees do Lamport and vector clocks actually provide?
Lamport timestamps give a total order consistent with causality: if A happens-before B, then ts(A) < ts(B). They cannot tell you if two events are concurrent. Vector clocks can: comparing two vectors element-wise tells you whether one precedes the other, or they are concurrent (each has higher values in different positions). Vector clocks are required for last-writer-wins reconciliation in AP systems like Dynamo.
Can a system be both linearizable and highly available across regions?
Not under network partition, that is the CAP result. Spanner gets close by using dedicated atomic clocks and private fiber to make partitions rare and brief, but during a true partition the minority side rejects writes. Any system claiming both is either not actually linearizable, not actually multi-region, or papering over partition behavior with stale reads.

Sources

  1. Designing Data-Intensive Applications, Chapter 9: Consistency and Consensus (Kleppmann)
  2. In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014)
  3. Spanner: Google's Globally-Distributed Database (Corbett et al., 2012)
  4. Avoiding Fallback in Distributed Systems (Amazon Builders' Library)
  5. Leader Election in Distributed Systems (Amazon Builders' Library)
  6. Database Internals, Part II: Distributed Systems (Petrov)
  7. The USE Method for Performance Analysis (Brendan Gregg)

About the author. Blake Crosley founded ResumeGeni and writes about backend engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.