I have put this question in the “Stream Processing” category, although I am not totally convinced this is the right place.
I am trying to understand how to archetect and build stateful services in the context of exactly once processing.
So far I understand how to perform exactly once processing using transactions in the context of stateless processors. I don’t understand how Kafka is intended to be used to perform exactly once processing when processes must maintain state.
There doesn’t appear to be much documentation on this online, probably because it is a difficult problem.
I have come up with two potential solutions which use Kafka, and a third solution which comes pretty close to providing exactly once guarantees but not completely.
Allow me to explain -
Consider an example problem of:
- Reading data from an input topic
- Performing some stateful transform on this data
- Producing data to an output topic
- The simple example of calculating a moving average is good enough. The moving average calculation must maintain some state (maybe: sum, total number of records) to calculate the moving average
Here are 3 possible approaches. I don’t know if any of these are along the lines of how Kafka expects me to build such a system. Keep in mind that transactions will be in operation for all of the following options.
-
Create a state topic, and enable log compaction. Store the entire application state in a message and use the same key for every message. This will cause the topic to maintain only the most up to date state. In some applications which maintain large quantities of state this may become inefficient due to the large volume of data transfered to the state topic.
-
Create a state topic, enable log compation. Only store the “bits” of the state which change. For example, if we had a process which contained a hash map, every time we added a key to the hash map, create a new message on the state topic, with the hash as the key. Every time we update a value in the hash map, again store the value in a message with the hash as the key. To delete an entry from the hash map we would need a special message type, again with the key being set to the hash to be deleted. This will be much more efficient in terms of not sending large messages to the state topic, however I tried to build a service like this and found it was extremely complicated to write and try and implement correctly. I believe this approach in general would be exceedingly hard to maintain without introducing bugs.
-
Store the state somewhere else, not in Kafka. Possible places to store the state could be on disk or in MongoDB. In this case, because we are not using Kafka for all possible ways in which data can transfer, it is not possible to implement exactly once semantics. However we can get pretty close by updating the state in Mongo/on disk just before committing the Kafka transaction. This is expected to fail only in rare circumstances, such as the power going off in the critical code section between sending data to Mongo (etc) and comitting to Kafka. This design will probably be “good enough”.
My question is: How should I approach the problem of designing stateful services with Kafka?