Tombstone messages are not deleted

Hi,

this topic is basically about the observation that delete marker records (tombstones) are not deleted as expected.

I have the following initial situation (using Kafka 2.6)

1 topic with 1 partition
Offsets: Partition: 0; low: 2; high: 947; offset: 947; #(high-low): 945
Number of not null records: 5
Number of all records     : 270

Topic configuration is this:

cleanup.polic=compact
min.compaction.lag.ms=3.600.000
min.cleanable.dirty.ratio=0.5
segment.ms=3.600.000
delete.retention.ms=86.400.000

The log files of this partition are

43029 Nov 21 20:55 00000000000000000000.log
  326 Nov 22 12:14 00000000000000000937.log
  772 Nov 24 15:11 00000000000000000939.log
  561 Nov 24 16:17 00000000000000000941.log
  174 Nov 24 17:17 00000000000000000943.log
  772 Dec 17 11:38 00000000000000000944.log
  174 Dec 17 12:38 00000000000000000946.log

The oldest record (tombstone) is from 2021-06-21. The last offsets are (distinguished between tombstone and data record):

936 tombstone (delete marker)
937 tombstone (delete marker)
938 tombstone (delete marker)
939 data
940 data
941 tombstone (delete marker)
942 data
943 tombstone (delete marker)
944 data
945 data
946 tombstone (delete marker) 2021-12-17

All older records are tombstone records. So, there is a huge amount of tombstone records that are not deleted although delete.retention.ms is one day. The *000.log and *937.log files include only tombstones and also the timestamps of these files are very old.

My questions is, why are the old tombstones not deleted? Trying to find an answer to this questions lead to the blog post “Kafka quirks: tombstones that refuse to disappear” and Kafka Issue KAFKA-8522.

But I am not sure if my observation fits to the post and issue. Espacially changing delete.retention.ms to ‘0’ did not change anything on my system.

Any explanation is really welcome. Hints to overcome this situation even more as we have topics with high throughput and lots of tombstone message which seem to gather at the beginning of the log.

Thanks and kind regards
Franz

Hi Franz! Tombstones have always been a hassle for me, so hopefully we can get to the bottom of your issue.

First of all, what do your keys look like? It would be very helpful to know the keys of the messages that you posted with offsets. I ask because in compacted topics, the topic is guaranteed to keep at least one value per key.

1 Like

Hi @danicafine,

thanks for reaching out. I don’t want to say that the key values are random, but they change permanently. If it is indeed the case that also for tombstone messages the last value is always kept, then that would be an explanation for my observation. But then I would have misunderstood the log compaction concept at that point.

Any consumer progressing from the start of the log will see at least the final state of all records in the order they were written. Additionally, all delete markers for deleted records will be seen, provided the consumer reaches the head of the log in a time period less than the topic’s delete.retention.ms setting (the default is 24 hours). In other words: since the removal of delete markers happens concurrently with reads, it is possible for a consumer to miss delete markers if it lags by more than delete.retention.ms .

There would be no need for a configuration property ‘delete.retention.ms’ if a final tombstone record would be kept. Becauer there would then be no danger of missing the tomstone, as it seems to be the case now.

Did I realy misunderstood the conecpt here. What would it be useful for to keep tomstone messages?

Kind regards
Franz

Compaction definitely does happen on a per-key basis. So if your keys are ever-changing/random, you won’t really get to take advantage of compaction in the typical sense. For example, I would use it to create a sort of look-up table where latest updates for a given key are saved and outdated values are deleted after a certain period of time.

Setting log.cleanup.policy = compact means that at least one value of each key will be saved. If you would like for compaction and standard deletion to happen, i.e. you want records to be compacted by key but you also want keys older than delete.retention.ms to be deleted, you’d need to set log.cleanup.policy = [compact, delete].

But keep in mind that only inactive file segments can be considered for compaction/deletion, so you may need to play around with the log.segment.bytes to get exactly what you want.

I think my use case is simple. I want at least the last record of one key beeing saved until I delete the key with a tombstone (delete marker) record. In this case I also expect that the tombstone record will be deleted after delete.retention.ms (and other necessary conditions like inactive file segments or min.cleanable.dirty.ratio are fulfilled).

Each keys will be deleted by a tombstone record in my use case sooner or later and data with new keys will appear. The fact that all data for deleted keys disappear at some point is critical for my use case, because otherwise the topic is filled up too much with delete marker records.

I would like to ask again clearly: Does the last tomstone record for a key really remain in a log compacted topic forever?

The additional option log.cleanup.policy=delete is not an option for me, as I don’t know when data can be finally deleted. Individual records can live for a very long time. I have to keep them until I explicitly delete them with a tomstone message.

I should clarify my last post.

delete.retention.ms does apply to compacted topics, e.g. ones with log.cleanup.policy = compact. delete.retention.ms does not apply to non-null messages in a compact topic, but a tombstone message for a given key will be kept around until delete.retention.ms has passed.

Coming back to your use case:

I would like to ask again clearly: Does the last tomstone record for a key really remain in a log compacted topic forever?

No, not by default.

You should be able to achieve what you want by simply using delete.retention.ms. But keep in mind that it’s 100% per key.

Apologies for the round-about answer. :slight_smile:

Also wanted to clarify.

What would it be useful for to keep tomstone messages?

This is helpful in the even that consumers go down. Imagine we have a consumer that reads from a topic and updates some internal state (that is safely stored locally), e.g. a count of times it sees a message for a given key. Suppose that when the consumer reads a tombstone message, it should remove that entry from the state, resetting the counter; otherwise, it increments the counter.

Suppose the consumer goes down, and, during that time, a tombstone message is sent. Maybe it takes a while for the consumer to come back up… and during that time, log compaction has occurred as well as the tombstone message being purged. In that scenario, when the consumer comes back up, it has no idea that the tombstone message happened, so it continues to increment the counter for that key, which, of course, is incorrect given the intended business logic.

That’s a contrived but very real example. We allow for tombstones to be kept for situations like this. Basically, it gives recovering consumers at least delete.retention.ms time to read tombstones before they’re cleaned up.

1 Like

Thanks @danicafine for clarification. So the expectation is that tombstone records will be deleted sooner or later (also the last tombstone of a key). This is not the case for me (Kafka 2.6). Possibly this is a bug in Kafka (KAFKA-8522 ). I think I will have to evaluate it in a Kafka update.