I have a KStreams application written in Kotlin that’s responsible for calculating an average value for 12 separate entities every 10 seconds. The application receives and stores measurements every second into a state store, and a
ContextualProcessor is responsible for calculating the average of these 1s values on a specified interval/schedule for each entity.
The application is running as a StatefulSet in OpenShift with 3 pods.
A couple of days ago this application experienced some downtime that I can’t seem to find the cause of. The downtime was around 14 minutes at 1 am, meaning that the downtime was not linked to any code changes or other updates.
The console log, being sent to Splunk, contained the following errors:
As the log-messages imply, one solution might be to increase the
I have received this type of error in the past. The attempted solution that time was to lower
max.poll.interval.ms from the default value (which I believe is 5 minutes) down to 2 minutes. The goal here was to allow the pods to “crash” and restart earlier, which would result in shorter downtime. Apparently, this did not work.
As it is now, with a downtime of around 14 minutes and
max.poll.interval.ms set to 2 minutes, I can’t see the correlation between them, and how increasing
max.poll.interval.ms would fix anything.
All pods have approximately the same CPU usage graph in OpenShift: A small spike around the incident, but well under the configured CPU usage limit. The memory usage graph was relatively flat during the issue.
Have anyone experienced simmilar issues in the past and can support me with a potential fix/improvement?