We have a ksqlDB cluster with 5 nodes and a stream created that reads from a topic:
CREATE OR REPLACE STREAM topic_stream
WITH (
KAFKA_TOPIC='kafka_topic',
VALUE_FORMAT='AVRO'
);
We also have a push query that reads from this ksqlDB stream
SELECT * FROM topic_stream WHERE session_id = '${sessionId}' EMIT CHANGES;
When the push query is started does the work get distributed across all 5 servers?
When we run this query during high traffic we noticed only 1 server has max CPU and the query starts lagging.
How do we parallelize push queries across our cluster? I couldn’t find any documentation on this.
There’s some relevant documentation about this here.
How many partitions does kafka_topic have? This dictates the amount of parallelism you can achieve.
Also, what is the key? I’m not positive about this, but it may be the case that if session_id is the key and you filter like this that ksqlDB / Kafka Streams will be smart about it and won’t try to distribute execution
You might also get a sense of how the query will execute by running EXPLAIN against the query.
Our topic has 99 partitions spread across 9 brokers in our kafka cluster. session_id isn’t the key for the topic (i dont believe we’re using any keys when producing records to the topic)