What is best practice when it comes to configuring your state-heavy Streams application to ensure High Availability (i.e. always process input data and produce output data) when the application is deployed as pods in OpenShift? I have tried numerous configurations, but I am still experiencing severe downtime (i.e. tens of seconds) when I do deployment of new code to OpenShift. The downtime is a result of the applicatoin’s need for state restoration by replaying changelog topics. The application is deemed critical and must have a maximum of 10 seconds of downtime on each deployment.
The application is consuming from a topic with a single partition. The application relies heavily on states (i.e. 4 state stores) to compute the output. The application is deployed in OpenShift as 3 pods. Since there is only one partition on the input topic, there is only 1 active pod while the 2 others remain on standby. The Streams config is
Properties().apply {
put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, appConfig.bootstrapServers)
put(StreamsConfig.APPLICATION_ID_CONFIG, appConfig.applicationId)
put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String()::class.java.name)
put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde::class.java)
put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler::class.java)
put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2)
put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 2)
put(StreamsConfig.consumerPrefix(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), "latest")
put(StreamsConfig.consumerPrefix(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG), appConfig.sessionTimeout)
}
where sessionTimeout = 10 seconds
. When shutting down a pod, the following shutdown logic is executed
val kafkaCloseOptions = CloseOptions().leaveGroup(true)
stream.close(kafkaCloseOptions)
The OpenShift health
and ready
-probes are defined as
stream.state() != KafkaStreams.State.ERROR && stream.state() != KafkaStreams.State.NOT_RUNNING
and the OpenShift deployment strategy is RollingUpdate
with an initial delay on the health
and ready
probes of 180 seconds. This means that a new pod will have 180 seconds on starting up, before the old pod is taken down.
As said, the main problem on deployment is that we are experiencing downtime due to state restoration. We had a hope that the standby replicas would minimize the restoration time, but it seems like the replicas do not have effect. It seems that the new active pod needs to replay the whole changelog topic to restore the state instead of retrieving it from one of the standby replicas.
Are we configuring something wrong? Are there any best practices for ensuring high availability?