Hi everyone,
I have a question,
I want develop an application that read and check condition of data record on Kafka topics, then if the condition match with an data record (or multiple with windows), then I rise alerts for that.
But i have a problem, When user register use app, user will register user’s topics for my application know it, each user has some topics.
How kafka-stream can handle it when the topics are dynamically ?
If all user use same topic I think it effect to performance of each users.
You could indeed use one topic for all users, partition by user, and start multiple Kafka Streams clients to scale out and improve performance. Make sure you set the number of partitions of the input topic high enough to be able to scale out sufficiently.
You could enforce a pattern on the registered input topics and create a stream with a pattern (i.e., StreamsBuilder#stream(final Pattern topicPattern)). This is similar to 1, but I guess in your setup you would not have control of the number of partitions of the input topics which might limit scale out.
You could create a topology for each registered user and create and start a KafkaStreams object for each topology. That basically boils down to having a distinct Kafka Streams application for each registered user. You could start each application in its own JVM or use one JVM for a given number of Kafka Streams applications.
You could also have a mix of option 1 and 3. You have a Kafka Streams application for each registered user that only writes the records from the user specified input topics to a common input topic that is managed by you. Then, you apply option 1 on this input topic.
The best option depends on your use case and setup.
Hi Mr Bruno, In 2th solution,
I will define pattern for all topic, when new user register i create new topics follow the patterns, so on each topics I just need 1-2 partition and I can easy scale topic when new user register. It’s ok ?
I do not exactly know what you mean with scaling topics.
The issue with option 2 is that the same partitions of each user-defined topic is processed by the same Kafka Streams client. That is, partition 0 of all user-defined topics are processed by one single Kafka Streams client. The same would apply to partition 1, 2, 3, etc.
In your case, you would continuously add new topics with 2 partitions but the data in the partitions would only be processed by at most two Kafka Streams clients. That might result in a processing bottleneck after a while.
I think option 1, 3 and 4 are the most promising options to avoid bottlenecks. Regarding option 3 I do not know how many Kafka Streams applications you can run with a single Kafka cluster and how the number of Kafka Streams applications affect performance. Maybe for your use case it does not matter. You need to benchmark it.