Seeking validation of reasoning on ISR, replication, and multi-DC resilience

I’d like to kindly ask for your feedback on the correctness of my reasoning regarding ISR (In-Sync Replicas), data durability, and availability in a multi-datacenter (multi-DC) Kafka deployment.

Let’s define ISR (In-Sync Replicas) as the dynamic set of all replicas of a given partition (including its leader) that are fully synchronized with the leader. A key property of ISR is that if the leader fails, any remaining ISR member can be elected as the new leader without data loss, assuming appropriate producer and broker configurations.

We consider the worst-case failure scenario: the loss of the one datacenter (DC) hosting the largest number of ISR replicas for a partition. If our design satisfies the desired guarantees under this worst case, it will naturally hold for failures of other DC (which host fewer or equal ISR for a partition).

With the loss of single DC our goal is to simultaneously guarantee, for every partition of every topic:

  • Data durability (no acknowledged message is lost),

  • Write availability (producers can continue writing),

  • Read availability (consumers can continue reading).

We further assume that producers are configured with acks=all, which ensures that a message is considered successfully written only after it has been replicated to at least min.insync.replicas ISR members.

Let us denote:

  • r = total number of ISR for a partition (normally equal to replication.factor),

  • x = min.insync.replicas (the value we aim to determine),

  • a = number of ISR for a partition located in the failing DC (the one with the most ISR for a partition),

  • b = number of ISR for a partition remaining in the surviving DC(s), so that r = a + b

To meet our three goals under the worst-case DC loss, the following inequalities must hold:

x > a Data durability The write quorum must include at least one ISR outside the failing DC, ensuring the message survives the failure
b ≥ x Write availability Enough ISR remain in healthy DCs to satisfy the producer’s acks=all requirement
b > 0 Read availability At least one ISR remains to be elected leader and serve consumers

Example: 3-DC Deployment In a trivial 3-DC setup with one broker per DC, replication.factor = 3, and rack-aware replica placement (broker.rack configured per DC), replicas are evenly distributed: a = 1, b = 2. Choosing min.insync.replicas = 2 satisfies all conditions:

  • x = 2 > a = 1

  • b = 2 ≥ x = 2

  • b = 2 > 0

This configuration, combined with: unclean.leader.election.enable = false (ensures only ISR members can become leaders, preventing data loss), acks = all on producers, provides strong durability and availability guarantees even if one entire DC fails.

Impossibility in a 2-DC Setup Now consider a 2-DC deployment. In the worst case, the failing DC contains at least as many ISR replicas as the surviving one, i.e.:

  • a ≥ b

Combining this with our earlier conditions:

  • x > a (durability),

  • b ≥ x (write availability)

we derive:

x > a ≥ b ≥ x ⟹ x > x — a contradiction.

Therefore, no natural-number values of r, x, a, and b can simultaneously satisfy all three goals in a 2-DC topology. This suggests that true fault tolerance with both durability and availability is fundamentally unattainable across only two datacenters under Kafka’s current replication model.

Configuration Summary (3-DC Case)

broker.rack Unique per DC Enables rack-aware replica placement across DCs
replication.factor 3 Three total replicase (one per DC due to rack awareness)
min.insync.replicas 2 Ensures writes survive loss of one DC
unclean.leader.election.enable false Elect leader from among ISR only
acks all Requires acknowledgment from min.insync.replicas replicas

Could you please confirm whether this reasoning is sound?