Kafka loses offsets after period of time (1 week default?)

Hi,

I discovered a slightly surprising behaviour with one of my Kafka clusters today.

I realized that one of the consumer processes has been offline for some considerable time. More than 1 week.

When I restarted it, the Kafka cluster seems to have lost all memory of the last comitted offset.

This was even more surprising because I inspected the consumer group with Conduktor Console before restarting it, and Conduktor Console was reporting that there was a stored offset for this particular consumer group.

After I restarted the process, the Conduktor Console no longer reported any committed offset associated with this consumer group.

I suspect that the action of the consumer group re-registering with the cluster caused the cluster to run its garbage collection process and therefore it removed the (now expired) consumer group offset data.

I found this in the documentation:

Here are some questions:

  • Am I correct in my guess that the reason for the consumer group offset being lost is most likely due to the fact it has expired (with a default expiry time of 1 week)?
  • Is there a way to prevent this from happening again?

Yes

The two things that come to mind are (1) increase the retention setting, or (2) consider storing the offset outside Kafka. Is this a common pattern? I.e., when you say “I realized that one of the consumer processes has been offline for some considerable time”, was that consumer supposed to be running with no downtime, or are long periods (“days”) of downtime expected?

Hey thanks for your reply. To answer your question here:

  • This process is intended to run continuously. It stopped running and was not noticed because it doesn’t do anything particularly important. The data it has to process is now just backlogged. I didn’t notice it has stopped as it doesn’t have any alerting/monitoring system attached to it.

In this specific context, the only issue with setting the retention to some arbitrarily long value is that it might go offline again, and I might not have as much time as I had in the past to keep an eye on it. It’s possible I might take it offline for a while and then start it up in the future.

It’s a shame “indefinitly long” isn’t a supported mode for the retention of offsets, while it is for regular topics. I don’t have time to add another system like a database or MongoDB to store the offsets in. That shouldn’t really be necessary. I might try and see if I can suggest this feature to someone. Not sure who or how I would contact someone to do that however.

I assume there’s no engineering reason why indefinite retention of topic offsets is not supported?

As a workaround, you could set it so high that it’s effectively “indefinitely long”… (2 billion minutes = 3800 years)

I’m fairly certain that this would need a KIP since configuration is a public interface. Here’s a blog about the KIP process. I would suggest -1 for indefinite to be consistent with retention.ms (“If set to -1, no time limit is applied”).

None that I know of!