I have a rather unusual problem to solve.
I have a Kafka topic which is very low throughput, and messages are produced to it sporadically and often in bursts, from many different producers. There is a consumer process that runs every 24h to process the messages. Sometimes processing fails and has to be re-attempted in the following run 24h later. The retention on the topic is 7 days, and the offset retention for the cluster is also set to 7 days.
I rely on Prometheus consumer offset lag metrics to determine how many new messages there are awaiting processing.
The problematic scenario is this:
- the consumer processes all pending messages in the topic, commits offsets
- for 7 consecutive days, there are no new messages - every time it runs, the consumer consumes no messages and commits no offsets
- after 7 days, the last committed offsets are deleted - we lose visibility of the consumer lag
- new messages arrive in the topic, we can’t see them in our metrics until the consumer runs successfully and commits offsets, at which point the consumer lag would be zero anyway
There is a related problem in that previously the topic had a record retention that was longer than the offset retention. This meant that after the 7 days of consuming no messages and the offsets being deleted, the consumer process started again from the earliest offset and re-processed messages that were already processed. I “fixed” this by ensuring the offset retention is as long as the topic retention, but it is not ideal.
I tried modifying the consumer to always commit offsets, even if no messages were consumed, but that fails with error “Local: No offset stored”. There seems to be no way for the consumer to keep its committed offsets “alive” and prevent them from being deleted, when there are no new messages to consume for a prolonged period.
I hope this makes sense, and thanks for reading.