How to Use Replication Throttling

The failure mode in the article is not really about planned partition reassignment. It is about what happens after a bad broker is stopped, repaired, and restarted. Once that broker comes back, replica catch-up can saturate the underlying storage on both the recovering follower and the up-to-date leaders serving replication traffic.

That matters because the cluster can look healthy enough to restart the broker, but then degrade again during catch-up. As replicas rejoin the ISR at different times, the recovering broker can begin participating in remote acknowledgements before the rest of the workload is fully stable, which means the blast radius is wider than simple partition leadership alone.

When to reach for it

Use replication throttling when a broker restart will trigger heavy replica catch-up and your disks do not have enough spare throughput to absorb that recovery traffic cleanly.

This is especially relevant in cloud environments with network-attached or throughput-limited volumes. The article calls out the classic case: a slow or faulty volume causes cluster pain, you stop the bad broker to let Kafka shrink the ISR and restore availability, then the restarted broker risks recreating the same latency problems during catch-up unless recovery traffic is constrained.

What the article is warning about

The point is not that replication is bad. The point is that replication catch-up consumes real disk and network throughput, and on a saturated or fragile storage profile it can push both the leader and follower back into the same slowdown you were trying to escape.

Jason Taylor's example assumes a balanced cluster with RF=3, min.isr=2, about 50 MB/s of external traffic per broker, and a 250 MB/s volume profile. In that setup, a restarted broker catching up can push total disk demand on peer brokers beyond the safe limit, which is where replication saturation starts feeding back into latency and timeout problems.

The actual Kafka configs involved

For this use case, the important knobs are the replication quota settings rather than the reassignment workflow itself. Kafka uses broker-level rate caps plus topic-level lists of which replicas should be throttled.

leader.replication.throttled.rate: broker-level cap for leader-side replication traffic in bytes per second.
follower.replication.throttled.rate: broker-level cap for follower-side replication traffic in bytes per second.
leader.replication.throttled.replicas: topic-level list of throttled leader replicas.
follower.replication.throttled.replicas: topic-level list of throttled follower replicas.
For this catch-up pattern, the article recommends setting the topic-level replica lists to * so the throttles apply broadly during recovery rather than only to a narrowly enumerated reassignment set.

How to apply it for restart catch-up

The article's example applies the broker-level follower throttle across every broker, then applies follower.replication.throttled.replicas=* across topics so the recovering broker cannot pull catch-up traffic fast enough to saturate peer volumes.

It then applies the same pattern on the leader side with leader.replication.throttled.rate and leader.replication.throttled.replicas=*, because leaders also pay disk and network cost while serving the recovery stream.

Broker-level follower example from the article: follower.replication.throttled.rate=190000000.
Broker-level leader example from the article: leader.replication.throttled.rate=175000000.
Topic-level recovery scope from the article: follower.replication.throttled.replicas=* and leader.replication.throttled.replicas=*.

How to think about the numbers

The article's model is capacity-first: leave enough throughput for foreground traffic, then let catch-up use only the remainder. On the leader side that means total available disk throughput minus the foreground produce path. On the follower side it means making sure the recovering broker can still pull faster than the cluster is writing, otherwise catch-up never finishes.

That is why the article explicitly warns that follower throttling cannot be lower than the effective incoming write rate. If the foreground write pressure is higher than the allowed catch-up bandwidth, the restarted broker will not close the gap and you will stay under-replicated indefinitely.

The trade-off you are accepting

This is an availability-versus-risk trade. Throttling protects the live workload and avoids reintroducing the storage saturation problem, but it also lengthens recovery time and leaves partitions under-replicated for longer.

The article gives a concrete example where protecting a 250 MB/s volume profile can stretch catch-up time to around eight hours. If that window is too risky for your environment, the answer is not to remove guardrails blindly. The answer is to provision enough spare throughput that the cluster can recover quickly without hitting the same saturation boundary.

Operational checks during recovery

Confirm the throttles directly with kafka-configs.sh --describe at the broker and topic levels so you know the guardrails are really in place.

Watch the restarted broker's lag and the live cluster's latency at the same time. You want lag to shrink without produce and consume latency climbing back into the failure mode that made the broker restart necessary in the first place.

The practical rule from Kafka's operations guide still applies here too: if the effective incoming write rate is higher than the throttle, replication may not make forward progress. So the lag on the recovering broker has to be going down, not just moving sideways.