Eternal data retention, pitfalls if any and KIP-405

Brief background - We are planning to use Kafka as the core component for building a data distribution exchange, through which domain data from one division of the org will be shared with other divisions internally. The solution will capture and distribute a stream of database updates (using log CDC). The authoring databases will not go away / be replaced, they will stay as is because all our content ingestion applications will first write to databases, as they always have.

We want to retain all the data in Kafka topics and enable logs compaction. Data size will be approx 1TB to begin with and growing at double digit GBs per month.

Queries -

  1. Is it an anti-pattern / simply inadvisable to have eternal data retention in Kafka? If not and if we choose to do it, what are some of the pitfalls in design and workings that one needs to be aware of?
    Do ack that this might be bit of a loaded question and I have thus added some contextual background. If more info needs to be provided, please do convey.

  2. Regarding Kafka Tiered Storage (KIP-405): Will this get to the point that we can use an S3 backed topic and be able to read that topic from the beginning as it were stored in Kafka? When and with which version can this be expected to be available?

  3. Extension to the above point, we found that tiered storage cannot be enabled for compacted topics. I can probably understand as to why it can’t be but if possible, can someone highlight if it can be expected with some future version?

Thanks & regards,
Sajal

2 Likes

I think it all kind of boiles down to how valuable the updates are vs how important it is to quickly go from start to end, having the correct current state.

Depending on this there are several options. For example you can have a compacted and eternal topic for each CDC source, so consumers can choose. Another solution could be to configure the compacted topic to always keep at least the latest 48 hours, so consumers have 2 days to get all the changes.

In a similar setup we went with just compacted topics, as other services where only interested in the latest state anyway, but it might be different in your case.

1 Like
  1. it’s not an anti-pattern to store the entire dataset in kafka. It is however expensive, so you need to ask if you really need it. Most of the time when people do this it is because they need the ability to “rehydrate” databases from kafka quickly. For example, new app comes online and needs their own $DATABASE_FLAVOR_OF_CHOICE and you need the ability to do this often without impacting the source database.

  2. KIP-405 is expected to be in AK 3.1.0. However, this is just the framework for tiered storage. [KAFKA-9565] Implementation of Tiered Storage SPI to integrate with S3 - ASF JIRA is the jira that implements the S3 storage layer, it does not have a release assigned to it.

  3. It’s unknown when tired storage will have compacted topics.

Since your post focuses highly on compacted topics, it’s worth talking about what compacted topics from CDC sources can get you. Remember that for the MAJORITY of CDC sources, you’re only getting the change. You will not get the entire changed row. Kafka’s compacted topic keep the latest value for a given key, so in your compacted topic you’ll only have the most recent change(ignoring how compaction doesn’t work in real time here). You may want to run some CDC tests and verify that compacted topics are getting you what you expect.

4 Likes

I’m with @mitchell-h, as his responses are spot on. I would add just one thing regarding storing your entire dataset in Kafka being an anti-pattern:

:arrow_right: Sometimes, people disregard the consequences of this design.

Pragmatically speaking, you need to think about how your consumer architecture will be somewhat different by forcing them to read the data using Kafka’s pub/sub architecture. For instance, if an app needs to read data from a specific point in time, it would need to write a bunch of boilerplate code using the consumer API to be able to do that. It should instead just query the data.

“But I can add a ksqlDB/Flink layer to enable data streams querying.”

Yes, you can. But now, you’re adding another distributed system to enable apps to query data. Both ksqlDB and Flink are great, but think about their usage in this scenario and tell me that this doesn’t look like a band-aid to you?

I once tried to implement a simple CRUD in Java using Kafka as my data store. The time I spent to make it work, the amount of workarounds I needed to write to handle edge cases, the trade-offs I needed to be exposed to just because Kafka didn’t support point-in-time-queries, not to mention the amount of hardware resources that I needed to use in the app to handle this off design — such as a caching layer for the data read off Kafka. IMHO, it is not worth it.

My opinion is :arrow_right: Kafka as the central backbone, yes. As a central data store… hell no!

5 Likes

I think what @riferrei is saying is “Start with your query layer and query patterns FIRST”. From that you can determine what processing goes where and what your whole stack will need to look like."

4 Likes

In addition to my story with @mitchell-h remarks. In our case we didn’t use the raw cdc data topic, but from the raw cdc data we had a topic which did contain everything. As indeed depending on the application cdc data could contain medification where not all the data for that key is included, and in that case if you use compaction it will lead to weird results.

Thank you very much @mitchell-h and @riferrei for your detailed responses; all points duly noted.

Firstly, from my initial post, the queries on KIP-405 have been adequately answered by @mitchell-h, so those stands resolved. Thank you very much.

Regarding the eternal data retention part of the post and the feedback provided by you all so far, here’s my responses -

  1. On @mitchell-h’s point on rehydration of databases, yes our scenario is similar to it. New consumers will need to read from offset 0 (whatever is the beginning) of a topic to capture all the data and thereafter move on to the incremental phase for keeping in sync with delta updates.
    And yes, we are trying to be mindful of the costing aspects. In our case, even with eternal data retention, the size of data will be in low single digit TBs, so storage costs might not be a big factor. But, we are definitely accounting for it.

  2. @mitchell-h’s point about CDC tests on compacted topics for data verification - valid point and we are factoring it in.

  3. @riferrei’s point about not having Kafka as a central data store - no, we aren’t planning to do that. And also as you highlighted, point in time queries or temporal queries aren’t a necessity for us.

    Just to reiterate here, the source DBs wouldn’t go away and all authoring applications would keep writing to the source DBs. Change events captured from the source DBs using CDC will flow via Kafka to downstream consumers. Kafka is being envisaged as a central component for data distribution internally within the org (across domains) and that’s why the eternal retention of data is important.

  4. @riferrei’s point on ksqlDB and adding another distributed system to enable apps to query data - hear you and & would want to avoid that.

  5. @mitchell-h’s last point on “Start with query layer and query patterns first” - yes, we are doing that, keeping query and data first approach and then working around it.

All said, the reason for creating this thread was because of having apprehensions pertaining with eternal data retention, without which future consumers will not have the entire data to read & access from.

Welcome further feedback on this and thank you again to everyone for your valued and prompt responses.

Thanks & regards,
Sajal

1 Like