Hi All,
We are facing an issue with the Kafka consumer poll behavior in our application. We have configured max.poll.records
to 3000 and implemented logic such that if the number of polled records is greater than or equal to 3000, we process the records and commit the offsets. Otherwise, we ignore the poll.
This setup has been working fine, but recently, we observed an unusual behavior. For almost 48 hours, the consumer consistently polled the same number of records (1500) in every poll, which is less than the configured threshold of 3000. As a result, we were unable to process the records, leading to a significant business impact. After this period, the issue resolved itself, and the consumer started polling with the expected record count (>=3000).
This behavior has been occurring intermittently for the past month, causing disruptions to our application’s processing.
Following are Consumer configurations, using deafult configurations other than below.
“auto.offset.reset” value=“earliest”
“enable.auto.commit” value=“false”
“max.poll.interval.ms” value=“360000”
“max.poll.records” value=“3000”
“max.partition.fetch.bytes” value=“2000000”
“heartbeat.interval.ms” value=“12000”
“session.timeout.ms” value=“120000”
each record size almost 160 bytes => 3000*160 = 480000(0.48MB)
Has anyone else faced a similar issue or have insights into what might be causing this behavior? Are there any fixes or configurations we should check to prevent this from happening again?
Any suggestions or guidance would be greatly appreciated.
You can look at increasing fetch.min.bytes
and fetch.max.wait.ms
to impact the behavior. Impact is the key word though - AFAIK there is no way to guarantee preventing what you are seeing given that multiple parallel fetches might happen under the hood. So, I believe that successive polls might greedily return fewer records than you want. The crux of the issue is that max.poll.records
is strictly an upper bound threshold when you are looking for a lower bound. I will poke around to see if there is another way but currently I think that you will need to either implement the lower bound threshold client-side, or reconsider whether you need a hard # records lower bound.
Thank you for your insights on this. We have attempted to increase the fetch.min.bytes
and fetch.max.wait.ms
settings; however, these changes did not have any impact on the issue. Unfortunately, we are still experiencing the problem intermittently, at least twice a week. Could you kindly share your suggestions?
That’s not entirely unexpected. There isn’t a way to guarantee this (no lower bound analogue of max.poll.records
). A couple of options that you can consider:
- Buffer in your application – call poll multiple times if needed until you get to >= 3000 records
- Take a look at Kafka Streams. You can use it to buffer a number of records and optionally have an upper time bound when you let a batch through even if you haven’t hit 3000. This Stack Overflow has some ideas and sample code.
I would probably lean toward option 1, though if Kafka Streams could replace the application entirely then option 2 is enticing. IOW IMO it comes down to how good a Kafka Streams fit your application is overall.