Hi,
We experienced a multi-broker outage on July 27th, our application topology recovered, or so it seemed…
After it recovered, 1 of our application replicas internal state store started exposing data that had been removed few weeks earlier, July 6th. The RockDb content was completely off.
Recovery for this instance either did a partial restore from the changelog and still managed to go in a Running state.
or
The internal RockDb database end up corrupted.
Respawning the application, brought the store back to its intended state.
But for a while our app was applying live update to a store that was few weeks off, so we had to replay input data from prior the outage.
We are trying to identify and prevent that issue from reoccuring.
I have included the log of the faulty instance, if you see anything unusual I would be happy to submit a ticket or to dig more info.
Do we have to perform cleanup of some sort within the internals of the RockDb local database? (we are in the process of exposing RockDb metrics to our graphana dashboard hoping to better detect the issue next time)
That application run a separate scheduled data store dump outside of the topology/punctuator, it just loop over the data with store.all() using a reference to the stream and store name, is it safe to do that? (trying to find potential bad practice that may have lead to that issue)
Thanks
Francois
Kafka 3.1
KStream 3.1
I put the log here
2022-07-27 14:28:27,033 INFO stream-thread [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] Processed 1444 total records, ran 0 punctuators, and committed 4 total tasks since the last update [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.streams.processor.internals.StreamThread)
2022-07-27 14:30:27,049 INFO stream-thread [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] Processed 1318 total records, ran 0 punctuators, and committed 4 total tasks since the last update [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.streams.processor.internals.StreamThread)
2022-07-27 14:32:35,757 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Disconnecting from node 2147482641 due to request timeout. [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:32:35,757 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Cancelled in-flight OFFSET_COMMIT request with correlation id 15357196 due to node 2147482641 being disconnected (elapsed time since creation: 30638ms, elapsed time since send: 30638ms, request timeout: 30000ms) [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:32:35,757 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Cancelled in-flight HEARTBEAT request with correlation id 15357197 due to node 2147482641 being disconnected (elapsed time since creation: 15018ms, elapsed time since send: 15018ms, request timeout: 30000ms) [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:32:35,757 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Group coordinator kafka-6.kafka-headless.kafka.svc.cluster.local:9092 (id: 2147482641 rack: null) is unavailable or invalid due to cause: coordinator unavailable.isDisconnected: true. Rediscovery will be attempted. [kafka-coordinator-heartbeat-thread | snapshot-processor] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
2022-07-27 14:32:42,929 INFO [AdminClient clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-admin] Node 1005 disconnected. [kafka-admin-client-thread | snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-admin] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:33:05,129 ERROR stream-thread [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] Committing task(s) 0_5 failed. [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.streams.processor.internals.TaskManager)
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {instruction-events-5=OffsetAndMetadata{offset=177512535, leaderEpoch=null, metadata='AQAAAYJAEmfQ'}}
2022-07-27 14:33:05,279 INFO stream-thread [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] Processed 1079 total records, ran 0 punctuators, and committed 3 total tasks since the last update [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.streams.processor.internals.StreamThread)
2022-07-27 14:33:05,782 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Disconnecting from node 1002 due to request timeout. [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:33:05,782 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Cancelled in-flight METADATA request with correlation id 15357199 due to node 1002 being disconnected (elapsed time since creation: 30023ms, elapsed time since send: 30023ms, request timeout: 30000ms) [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:33:05,782 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1-consumer, groupId=snapshot-processor] Cancelled in-flight FIND_COORDINATOR request with correlation id 15357198 due to node 1002 being disconnected (elapsed time since creation: 30023ms, elapsed time since send: 30023ms, request timeout: 30000ms) [snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThread-1] (org.apache.kafka.clients.NetworkClient)
2022-07-27 14:33:05,782 INFO [Consumer clientId=snapshot-processor-365e45d3-7ca3-4abb-9459-bb771df80f97-StreamThr
This file has been truncated. show original