Hey guys! Does anyone have any suggestions to perform Kafka Streams Disaster Recovery in Confluent Cloud. I am using Cluster Linking to mirror my data. While I have read that you should failover your streams app and then re-process the messages from earliest, my concern is that the retention period in the Disaster Recovery Cluster may alter state. Meanwhile, if I manage two applications in two different Regions, it can become a bit hectic.
Does anyone have a good way to allow for a quick RTO/RPO when managing Kafka Streams?
Depends on your processing logic. If your operations are “windowed”, you might re-create the oldest windows only partially, but newer windows would be complete, and thus state should be fine. In the end, if you re-process from earliest, you produce duplicate output anyway, right?
If you have non-windowed processing, it’s overall difficult to achieve fail over w/o potential data loss, because Cluster Linking is async:
You could use Cluster Linking to mirror the changelog and repartitions topics. However, because of the async data transfer, input topic offsets on the source cluster might be committed before data is replicated to the target cluster and thus, if you fail over to the last committed offset, data might be missing. – You could try to compensate for this by re-processing some more input data but it would be difficult to know how far you would need to go back.
Hope this helps.