I’ve got 3 topics:
- topicA: 50k records
- topicB: 500k records
- topicC: 20k records
They’re consumed as KTables and joined via a foreign key join, where this key is a regular property in topicA and the key in topicB and topicC:
topicA join topicB leftJoin topicC
These joins result in:
- 50k records in a KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000009-topic
- 2 million records in a KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000009-topic
- 2 million records in a KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000017-topic and
- 2 million records in a KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000032-topic
Now these aren’t serious amounts for Kafka, but still, it takes 4 hours for the stream application to complete. Neither CPU, nor memory is the bottleneck, as the transaction rate is a mere few dozen to a few hundred records per second.
This is reproducible and has also been posted on stack overflow by others, too: Kafka Streams KTable-KTable non key join performance on skewed tables - Stack Overflow.
Other kinds of joins, such as KStream-GlobalKTable or KStream-KTable don’t have this problem.
What can be done to improve the throughput?