ksqlDB how to properly chunk data?

xorlop · 29 January 2024 08:24

I am trying to use ksqlDB with the REST API and a WSS (web socket server, to control authorization) which talks with React web app, but I am confused if I am going about it the right way:

When I query a stream of dogs people brought to the dog park based on the ‘doggos’ topic, I receive a never-ending stream of JSON. This makes sense.

However, say I query a precomputed table of dogs with black fur for all records:

I notice that the rows still stream in individually until the end. This is not good because I want the entire chunked database and then individual records may come in afterwards.
** Is there way to get ksqlDB to chunk this data, or do I have to do this myself? I am confused as to why it still chunks the data even though it already knows ahead of time how many rows it has at query time.
** I have a hunch that protobuf would help here, but I am not sure if it would still be chunked up, just in protobuf instead

I would prefer to keep using ksqlDB with the REST API but open to considering other options.

igorkhalitov · 31 January 2024 17:35

…I notice that the rows still stream in individually until the end.

In ksqlDB, the standard operational mode for real-time processing is based on continuous data streaming.

When you run a query on a table, the initial result you get is essentially a snapshot representing the current state of the table.

Following this, the system proceeds to stream updates in real time, reflecting any changes as they occur.

…This is not good because I want the entire chunked database and then individual records may come in afterwards.
… Is there way to get ksqlDB to chunk this data, or do I have to do this myself?

You may have to handle the prefetching process manually.

This implies that before you begin processing the “stream of dogs people brought to the dog park,” you might need to first gather all the records from the “table of dogs with black fur.”

This approach is similar how Kafka Streams operates its topology graph.

Essentially, Kafka Streams makes sure that its local state stores are filled with the latest state of the tables before it starts processing any stream.

…I am confused as to why it still chunks the data even though it already knows ahead of time how many rows it has at query time.

You’re asking if the data from the table is delivered in incremental portions, correct?

This happens because the Kafka consumer acquires data in batched segments, a method often referred to as long polling.

The batch size for each of these data retrieval operations is determined by various configuration parameters in the Kafka consumer, such as fetch.min.bytes, fetch.max.bytes, and max.poll.records, among others.

These settings essentially control the volume of data fetched by the consumer in each request.

Could you please tell the estimated number of clients you expect and which WSS you are utilizing?

xorlop · 1 February 2024 15:27

Igor! Thank you so much for your thoughtful reply.

That makes a lot of sense. I am glad to hear that I am getting expected behavior. Right now, I am constrained to use NextJS as my frontend/backend. Thus, I have created an API route using SocketIO (https://socket.io/) which automatically handles reconnecting and can poll in case of websocket failure.

I think a generous max of 20 clients at one time. However, the UI I am building would require at least a few ksqlDB rest queries to be running for one dashboard. So I think that max load it would experience is 80 ish connections, but that is being generous assuming all 20 users are using the dashboard at the same time. Is having too many clients a cause for concern?

igorkhalitov · 1 February 2024 17:48

Seeing as you’re well-versed in the Reactive approach (and I commend you for that), I’m inclined to think you’d opt for push queries rather than pull queries. Let me provide some details on that:

ksqlDB supports a subscription mechanism called push queries that allows matching rows to be queried in exactly this manner. Up to this point, the number of concurrent queries that it supports has been limited. Starting with Confluent Cloud and ksqlDB 0.22, it can now support larger scale use cases with push queries v2. These queries allow:

Up to 1,000 concurrent subscriptions, per ksqlDB server instance, across numerous clients (This is an estimate and depends on the rate of data production)
Lightweight on-the-fly matching using the query WHERE clause
Easy subscription lifetime management lasting the life of a query
Best effort message delivery

Topic		Replies	Views
🎧 Advanced Stream Processing with ksqlDB ft. Michael Drogalis News and Blogs	0	3455	11 August 2021
Creating large number of kstreams on the fly ksqlDB	2	3289	11 December 2021
Are ksqlDB push queries distributed across cluster? ksqlDB	3	39	1 September 2024
RECORDING AVAILABLE - ksqlDB 101 Events	0	3265	19 November 2021
Data Snapshot using KSQL ksqlDB	3	3163	25 November 2021

ksqlDB how to properly chunk data?

Related topics