Performance impact of filtering out unnecessary records from KTable before joining

Hello!

I’m working on a Kafka Streams application that use RocksDB to save its state. I have a question regarding the performance of joins involving a KTable that may contain a large number of unnecessary records.

In my case, I have the opportunity to filter out a significant portion of these unneeded records from the KTable (by returning null in a KTable#mapValues).

I’m wondering how much this might impact performance.

Here are the join scenarios I currently have involving this KTable:

  • It’s on the right side of a FK left join.
  • It’s on the right side of a PK join.
  • It’s on the left side of a PK left join.

An important detail: records in this KTable are expected to be updated very rarely. Most of the time, they are either newly inserted or deleted, not modified.

Additionally, filtering out unnecessary records from the KTable will not affect the number of records in the resulting join tables.

Given that, would pre-filtering this KTable to remove a large number of irrelevant entries (via KTable#mapValues that could return null) yield noticeable performance improvements, or does Kafka Streams handle this efficiently enough on its own during the join process?

Thanks in advance for answers :slight_smile:

That’s hard to answer. For sure, using smaller stores (if it’s a significant amount of data removed) sounds like a good thing to do. Smaller is always better (could give you higher cache hit rate, and more efficient RocksDB compaction). – You should ensure that the correct KTable is materialized via Topolog#describe(); and if necessary, enforce the correct materialization using Materialized parameter.

In general, PK-joins use direct key-lookups, and if RocksDB data was compacted and bloom filters are used, the perf (ie, throughput) difference might not be too huge. For FK-joins it might have larger gain, as FK joins use (expensive) range scans: so “unnecessary” inserts/updates into the left input table, could avoid unnecessary range scans (which don’t return any data/join results), but are still not “for free”.

I am wondering though, why you would use mapValues() instead of filter()?

In the end: the best thing is always to just try it out in some dev environment and measure it :slight_smile:

1 Like

Thank you for the answer :slight_smile:

I am wondering though, why you would use mapValues() instead of filter() ?

I use mapValues instead of filter in this case because I don’t need all the fields in the filtered KTable. This allows me to map it to a smaller object or return a tombstone if the condition is not met.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.