KStreams StatefulSet Unexpected Downtime

HGodal · 19 October 2023 07:29

I have a KStreams application written in Kotlin that’s responsible for calculating an average value for 12 separate entities every 10 seconds. The application receives and stores measurements every second into a state store, and a ContextualProcessor is responsible for calculating the average of these 1s values on a specified interval/schedule for each entity.

The application is running as a StatefulSet in OpenShift with 3 pods.

A couple of days ago this application experienced some downtime that I can’t seem to find the cause of. The downtime was around 14 minutes at 1 am, meaning that the downtime was not linked to any code changes or other updates.

The console log, being sent to Splunk, contained the following errors:

As the log-messages imply, one solution might be to increase the max.poll.interval.ms.
I have received this type of error in the past. The attempted solution that time was to lower max.poll.interval.ms from the default value (which I believe is 5 minutes) down to 2 minutes. The goal here was to allow the pods to “crash” and restart earlier, which would result in shorter downtime. Apparently, this did not work.

As it is now, with a downtime of around 14 minutes and max.poll.interval.ms set to 2 minutes, I can’t see the correlation between them, and how increasing max.poll.interval.ms would fix anything.

All pods have approximately the same CPU usage graph in OpenShift: A small spike around the incident, but well under the configured CPU usage limit. The memory usage graph was relatively flat during the issue.

Have anyone experienced simmilar issues in the past and can support me with a potential fix/improvement?

mjsax · 20 October 2023 03:29

Hard to say. I guess you need to understand the internal architecture of Kafka Streams a little bit more, to be able to reason about it, and to understand what your Topology properties are, and what configs might be related.

There is a recent talk about the internal architecture of Kafka Streams: Agenda | Current 2023

HTH.

Topic		Replies	Views
How to ensure High Availability for State-heavy Applications running in OpenShift Kafka Streams	8	2257	6 June 2023
Kafka streams application getting stuck after broker restarts Kafka Streams	3	4977	19 May 2021
Deploy Kafka streams app post downtime Kafka Streams	0	2941	13 October 2022
Kafka Stream co-operative rebalance storm and Processing Blocked Kafka Streams	5	3991	16 July 2021
Kafka streams rebalance storm Kafka Streams	7	6689	16 July 2021

KStreams StatefulSet Unexpected Downtime

Related topics