How to ensure High Availability for State-heavy Applications running in OpenShift

What is best practice when it comes to configuring your state-heavy Streams application to ensure High Availability (i.e. always process input data and produce output data) when the application is deployed as pods in OpenShift? I have tried numerous configurations, but I am still experiencing severe downtime (i.e. tens of seconds) when I do deployment of new code to OpenShift. The downtime is a result of the applicatoin’s need for state restoration by replaying changelog topics. The application is deemed critical and must have a maximum of 10 seconds of downtime on each deployment.

The application is consuming from a topic with a single partition. The application relies heavily on states (i.e. 4 state stores) to compute the output. The application is deployed in OpenShift as 3 pods. Since there is only one partition on the input topic, there is only 1 active pod while the 2 others remain on standby. The Streams config is

Properties().apply {
        put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, appConfig.bootstrapServers)
        put(StreamsConfig.APPLICATION_ID_CONFIG, appConfig.applicationId)
        put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String()
        put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2)
        put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 2)
        put(StreamsConfig.consumerPrefix(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), "latest")
        put(StreamsConfig.consumerPrefix(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG), appConfig.sessionTimeout)

where sessionTimeout = 10 seconds. When shutting down a pod, the following shutdown logic is executed

val kafkaCloseOptions = CloseOptions().leaveGroup(true)

The OpenShift health and ready-probes are defined as

stream.state() != KafkaStreams.State.ERROR && stream.state() != KafkaStreams.State.NOT_RUNNING

and the OpenShift deployment strategy is RollingUpdate with an initial delay on the health and ready probes of 180 seconds. This means that a new pod will have 180 seconds on starting up, before the old pod is taken down.

As said, the main problem on deployment is that we are experiencing downtime due to state restoration. We had a hope that the standby replicas would minimize the restoration time, but it seems like the replicas do not have effect. It seems that the new active pod needs to replay the whole changelog topic to restore the state instead of retrieving it from one of the standby replicas.

Are we configuring something wrong? Are there any best practices for ensuring high availability?

Using CloseOptions to send a leave-group request only works for static groups members, but it seems you did not configure a, to no leave group request would be sent.

experiencing downtime due to state restoration

For a clean shutdown, no state restoration should be necessary. Can you confirm (from the logs) that the pod/app is shutting down cleanly? When a shutdown happens, a standby should actually take over. Are you observing a corresponding rebalance? Are standbys caught up enough?

Manually killing the active pod in OpenSift works as expected: We almost immediatey get a rebalance and one of the standby pods takes over the load. The standby is also caught up and can start to process the data immediately.

The problem of downtime due to state restoration occurs when we do a new deployment of the application to OpenShift. It seems like the new pods in the deployment needs to replay the changelog topics instead of fetching the state from the pods in the old deployment. Then, when the last old pod is taken down (after 3 * 180 seconds), I am left with pods still restoring state.

I am struggling to understand why the new pods can’t fetch the state from the old pods? Or is this not how standby replicas work?

I am not a k8s expert so not 100% sure what a deployment is, but I would assume that a new deployment means you spin up new pods (with empty disks attached)? – Is the old deployment still running, or do you first stop the old deployment, and later start the new one? I assume it’s the second case, because the should stay the same (otherwise, new and old deployment would basically be a scale out scenario from a KS POV, and the new pods won’t get any active nor any standby tasks assigned. For the second case, you would need to ensure (don’t know if k8s supports this), to re-attach the old disks to the new deployment instead of getting new / empty disk.

In general, a task always looks at its local disk, and if state is not found it will restore it from the changelog. Also standby tasks, maintain their stores, by tailing the changelog.

What I mean by deployment is deploying pods with new/modified code on OpenShift. The is not changed between deployments. We are using a RollingUpdate deployment strategy, meaning that one new pod gets spun up and when it is marked as “ready” an old pod is taken down. This continues until all pods have been replaced.

So, if we assume that I normally run three instances/pods of the application, during the deployment phase I will have four pods running:

3 old → 1 new + 3 old → 2 new + 2 old → 3 new + 1 old → 3 new

Based on my deployment strategy, it looks like the first case you presented best describes my situation. I agree that when we have 1 new pod + 3 old pods the new pod is just a scale out, but shouldn’t one of the new pods be assigned standby tasks when we have 2 new + 2 old?

Furthermore, are you suggesting that the only way to ensure high availablity is to persist the local disk (i.e. state.dir) between deployments?

Thanks for the details. For your case, the new pods won’t get assigned anything in the beginning, because you can only give work to 3 pods (1 active tasks for the single input partition, plus 2 standby tasks). Also because we try to avoid state movement, the existing active and standbys will stay on the old pods, and won’t be moved to the new pods.

Only in step 3, when you stop an old pod for the first time, the corresponding active or standby would be moved. If it’s the active task, an existing standby would take over as active. In any case, only now one more standby would be assigned to one of the 3 new pods (and it will need time to rebuild state from the changelog if its local disk is empty.

If you give enough time, it will eventually get “hot” and later be able to also be promoted to an active.

The same happens when you stop the second old pod. The critical step is when you stop the first old pod (which will now be the active most likely). If you stop it too early, and the two standbys are not hot yet, they can of course not take over right away.

However, I am not sure why you are following such a complex upgrade strategy to begin with. It would be much more efficient to deploy a stateful set, and just bounce instances re-attaching the local disk to avoid reading from the changelog. For this case, you would not add a new pod, but just stop one, upgrade, and restart it, and repeat for the others. So you standbys would go down 2->1 temporarily, but this also kinda happens in your current strategy. On restart however, the standby won’t need to start with an empty disk but can resume from where it left off when it was stopped and but staying hot.

Furthermore, are you suggesting that the only way to ensure high availability is to persist the local disk (i.e. state.dir) between deployments?

Not necessarily. As I said. In your case, when you stop an old pod, a standby will be migrated to a new pod. It just take longer until the standby is hot (compare to reattaching the local disk). During this phase, you still have the active and one (old) standby, and thus you still have HA. – The issue you see is really only because you move too fast, not given enough time to keep standbys hot, and thus “losing” HA temporarily.


Thank you for a thorough description and answer :slight_smile:

We have considered using statefulset, and after hearing your opinions it seems to be the way to go!