Creating large number of kstreams on the fly

We have use-case where large numbers of events are streaming on kafka topic almost 100k messages per day. On the consumer side We need to apply some predefined set of rules on the incoming event stream and pass event to downstream only when rule evaluation succeeded. No of rules could be 10k for avg case. We wanted to avoid doing database lookup as it can be expensive hence we are exploring KSQLDB as it fits our usecase. In our case this rules directly fits into the where clause of ANSI SQL Query. Hence we are planning to model it in following manner.

  1. Firstly We will create standard ksqldb stream for incoming events named as eventStream .
  2. Secondly We will be creating streams on the fly for each rule where corresponding condition of rule would be concat with the where clause of Select query
    CREATE  STREAM  ruleStream_1
        WITH (kafka_topic='xyz',
              value_format='json') AS
        SELECT 1 as column1, array(1,2,3) as column2, *
        FROM eventStream
        where (column3=123 and column4='testValue');

   CREATE  STREAM  ruleStream_2
        WITH (kafka_topic='xyz',
              value_format='json') AS
        SELECT 1 as column1, array(1,2,3) as column2, *
        FROM eventStream
        where (column3=123 and column4='testValue');   
    ....   
    ....
    CREATE  STREAM  ruleStream_n    ```

We are thinking to deploy ksqldb Server Cluster in interactive mode and will use java client to create,drop streams on the fly whenever corresponding changes are done in the rule definition by user.

We are planning to do performance benchmarking on that, but before that we wanted to validate whether Is it advisable to create 10k streams in KSQL DB and Can it cause performance issues ?

Hard to say, but it does not sound efficient. Your load is actually not very high, as a single stateless query (depending on your hardware) should be able to process 100K messages per second (and you only have 100K per day, it, one per second).

However, the issue is that each query would read the input topic once, ie, if you deploy 10K queries, you read the input data 10,000 times…

It might be more efficient to fall back to Kafka Streams (if it’s ok with you to write a Java application), as it would allow you to read the data only once.