Hello,
we are using Spring Kafka together with Kafka 2.5. We have a cluster consisting of 3 brokers, with ISR 1 and an RF of 2.
For some time now, we have been seeing the following types of log messages:
[Consumer clientId=consumer-reporting-service-30, groupId=reporting-service] The following partitions still have unstable offsets which are not cleared on the broker side: [invoice-service-1], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log
After researching these INFO
logs a bit, there unfortunately wasn’t all that much information regarding unstable offsets
aside from some source code documentation:
/**
* An unstable offset is one which is either undecided (i.e. its ultimate outcome is not yet known),
* or one that is decided, but may not have been replicated (i.e. any transaction which has a COMMIT/ABORT
* marker written at a higher offset than the current high watermark).
*/
What makes this more interesting is that, in some cases, these log message are almost exactly ​transaction.max.timeout.ms
(900000ms) apart:
Oct 14, 2021 @ 11:18:17.637 [Consumer clientId=consumer-coupon-service-16, groupId=coupon-service] Fetching committed offsets for partitions: [invoice-service-8]
Oct 14, 2021 @ 11:18:17.639 [Consumer clientId=consumer-coupon-service-16, groupId=coupon-service] Failed to fetch offset for partition invoice-service-8: There are unstable offsets that need to be cleared.
Oct 14, 2021 @ 11:18:17.639[Consumer clientId=consumer-coupon-service-16, groupId=coupon-service] The following partitions still have unstable offsets which are not cleared on the broker side: [invoice-service-8], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log
Oct 14, 2021 @ 11:18:17.659[Consumer clientId=consumer-coupon-service-6, groupId=coupon-service] Committed offset 123345052 for partition invoice-service-8
// 15m later
Oct 14, 2021 @ 11:33:17.637[Consumer clientId=consumer-coupon-service-16, groupId=coupon-service] The following partitions still have unstable offsets which are not cleared on the broker side: [invoice-service-8], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log
Going by the following diagram of the LEO and HW:
It seems to me that in this case, an unstable offset is indicative of replication lag, e.g. the offsets are not replicated fast enough in some cases, triggering a timeout. However, in this case I would have expected replica.lag.time.max.ms to kick in:
If a follower hasn’t sent any fetch requests or hasn’t consumed up to the leaders log end offset for at least this time, the leader will remove the follower from isr
We checked the current replication offsets however and the ISR seems to be behaving properly. Similarly, Broker logs themselves did not point to any obvious problems.
Does anyone have an idea as to what we could further investigate in this case?