Are ksqlDB push queries distributed across cluster?

DustinKLo · 2 August 2024 22:15

We have a ksqlDB cluster with 5 nodes and a stream created that reads from a topic:

CREATE OR REPLACE STREAM topic_stream 
WITH (
    KAFKA_TOPIC='kafka_topic',
    VALUE_FORMAT='AVRO'
);

We also have a push query that reads from this ksqlDB stream

SELECT * FROM topic_stream WHERE session_id = '${sessionId}' EMIT CHANGES;

When the push query is started does the work get distributed across all 5 servers?

When we run this query during high traffic we noticed only 1 server has max CPU and the query starts lagging.
How do we parallelize push queries across our cluster? I couldn’t find any documentation on this.

Thank you.

dtroiano · 6 August 2024 20:32

There’s some relevant documentation about this here.

How many partitions does kafka_topic have? This dictates the amount of parallelism you can achieve.

Also, what is the key? I’m not positive about this, but it may be the case that if session_id is the key and you filter like this that ksqlDB / Kafka Streams will be smart about it and won’t try to distribute execution

You might also get a sense of how the query will execute by running EXPLAIN against the query.

DustinKLo · 7 August 2024 18:37

Hi thanks for the response.

Our topic has 99 partitions spread across 9 brokers in our kafka cluster.
session_id isn’t the key for the topic (i dont believe we’re using any keys when producing records to the topic)

we set ksql.query.pull.table.scan.enabled to true so we are able to filter on non-key’d fields

However we were able to work around this by creating a persistent query which reads from our kafka topic

CREATE OR REPLACE STREAM persistent_query_stream AS SELECT * FROM topic_stream;

then changed our push query to read from this new persistent query

SELECT * FROM persistent_query_stream WHERE session_id = '${sessionId}' EMIT CHANGES;

according to this confluent article:

To utilize this new scalable codepath, push queries v2 require that a push query select from an existing persistent query

this seemed to solve our lag issues and the CPU usage is more distributed across our ksqlDB cluster, instead of 1 server working on the push query

system · 1 September 2024 22:15

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
Using Modular Topologies in Kafka Streams to Scale ksql’s Persistent Queries Summit	0	3021	22 April 2022
Cache/Query latest value of each key in a topic? ksqlDB	2	3697	2 July 2021
Streaming ksqlDB push queries via HTTP SSE in a Quarkus application ksqlDB	1	21	13 August 2024
Connect single ksqlDB to multiple Kafka clusters ksqlDB	2	734	28 April 2024
Select events from a topic between two dates without emitting changes ksqlDB	2	3450	9 March 2021

Are ksqlDB push queries distributed across cluster?

Related topics