Performance impact of filtering out unnecessary records from KTable before joining

Dawid · 27 May 2025 12:38

Hello!

I’m working on a Kafka Streams application that use RocksDB to save its state. I have a question regarding the performance of joins involving a KTable that may contain a large number of unnecessary records.

In my case, I have the opportunity to filter out a significant portion of these unneeded records from the KTable (by returning null in a KTable#mapValues).

I’m wondering how much this might impact performance.

Here are the join scenarios I currently have involving this KTable:

It’s on the right side of a FK left join.
It’s on the right side of a PK join.
It’s on the left side of a PK left join.

An important detail: records in this KTable are expected to be updated very rarely. Most of the time, they are either newly inserted or deleted, not modified.

Additionally, filtering out unnecessary records from the KTable will not affect the number of records in the resulting join tables.

Given that, would pre-filtering this KTable to remove a large number of irrelevant entries (via KTable#mapValues that could return null) yield noticeable performance improvements, or does Kafka Streams handle this efficiently enough on its own during the join process?

Thanks in advance for answers

mjsax · 27 May 2025 15:56

That’s hard to answer. For sure, using smaller stores (if it’s a significant amount of data removed) sounds like a good thing to do. Smaller is always better (could give you higher cache hit rate, and more efficient RocksDB compaction). – You should ensure that the correct KTable is materialized via Topolog#describe(); and if necessary, enforce the correct materialization using Materialized parameter.

In general, PK-joins use direct key-lookups, and if RocksDB data was compacted and bloom filters are used, the perf (ie, throughput) difference might not be too huge. For FK-joins it might have larger gain, as FK joins use (expensive) range scans: so “unnecessary” inserts/updates into the left input table, could avoid unnecessary range scans (which don’t return any data/join results), but are still not “for free”.

I am wondering though, why you would use mapValues() instead of filter()?

In the end: the best thing is always to just try it out in some dev environment and measure it

Dawid · 28 May 2025 09:43

Thank you for the answer

I am wondering though, why you would use mapValues() instead of filter() ?

I use mapValues instead of filter in this case because I don’t need all the fields in the filtered KTable. This allows me to map it to a smaller object or return a tombstone if the condition is not met.

system · 4 June 2025 09:44

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very slow KTable-KTable foreign key join performance Kafka Streams	4	3015	25 May 2023
Problem with missing null propagation on inner KTable-KTable join Kafka Streams	7	438	11 March 2025
ksqlDB: Stream to join with Table Stream Processing	0	34	15 September 2024
Missing records in STREAM-TABLE join ksqlDB	6	2364	16 September 2023
Problem with left-join Kafka Streams	1	2526	25 April 2023

Performance impact of filtering out unnecessary records from KTable before joining

Related topics