Hi Kafka community,
I have a 3-node Kafka 3.8.1 cluster and Connect cluster deployed identically in two sites (Site A and Site B).
In Site A, everything works smoothly. In Site B, we observe frequent and long-lasting rebalances, especially under occasional disk latency spikes, even though NVMe disks are used there too.
What I did so far:
- Collected rebalance logs (with “rebalance delay: 30000 ms” etc.)
- Analyzed connector config (Humio HEC sink): small buffer, no timeout/backoff/threads config, errors.tolerance = none, etc.
- Noted that in Site B storage occasionally has momentary I/O latency increases (though same hardware type).
- Proposed patch with
humio.hec.buffer_size = 1000,timeout.ms = 10000,threads = 3, backoff settings, and changingerrors.toleranceetc.
Questions:
- Given this context, would the Kafka community consider this a misconfiguration, bug, or expected behavior in edge scenarios?
- Could there be bugs logic that exacerbate such site-specific latency spikes?
- Are there known best practices or community-backed config suggestions for connector sinks in geographically distinct sites with intermittent latency?
- Would enabling static membership or tweaking
scheduled.rebalance.max.delay.mshelp significantly in such cases?