Hi,
We are having a problem with kafka 2.8.2. Seems like for an unkown error it increase the producer offset to something like 4 billions in a new cluster with new topics.
We were running a cluster of 3 brokers v0.9 brokers for years without much problem or bugs that couldnt be resolved just restarting the brokers.
We have created a new cluster with version 2.8 (updated our apps libraries too and pointed them to the new cluster) with almost default config and topics were newly created and everything were going pretty well for a few minutes. The setup is as following
Brokers:
perm01 - id 10
perm02 - id 11
docu01 - id 12
Config: is the default changing broker id, listeners nad this values
default.replication.factor=2
min.insync.replicas=1
replica.fetch.wait.max.ms=5000
replica.lag.time.max.ms=30000
replica.socket.timeout.ms=60000
Until suddendly, offsets were increased in some partitions from around 1000 to more than 4000000000. All these topics had as leader docu01, logs from partition auto-0 a replica at the time.
Audits were being consumed without problems until 1:28, then at 1:35 these happens. As you can see there is a offset jump from 1204 to 4261767024
[2023-03-02 01:35:44,319] INFO [ProducerStateManager partition=audit-0] Writing producer snapshot at offset 4261767024 (kafka.log.ProducerStateManager)
[2023-03-02 01:35:44,327] INFO [Log partition=audit-0, dir=/var/log/kafka2/topics] Rolled new log segment at offset 4261767024 in 9 ms. (kafka.log.Log)
[2023-03-02 01:35:44,329] INFO The cleaning for partition audit-0 is aborted and paused (kafka.log.LogManager)
[2023-03-02 01:35:44,329] INFO [Log partition=audit-0, dir=/var/log/kafka2/topics] Truncating to offset 1204 (kafka.log.Log)
[2023-03-02 01:35:44,330] INFO [Log partition=audit-0, dir=/var/log/kafka2/topics] Deleting segments as part of log truncation: LogSegment(baseOffset=4261767024, size=199, lastModifiedTime=1677717344000, largestRecordTimestamp=Some(1677717344276)) (kafka.log.Log)
[2023-03-02 01:35:44,331] INFO [Log partition=audit-0, dir=/var/log/kafka2/topics] Loading producer state till offset 1204 with message format version 2 (kafka.log.Log)
[2023-03-02 01:35:44,332] INFO [ProducerStateManager partition=audit-0] Writing producer snapshot at offset 1204 (kafka.log.ProducerStateManager)
[2023-03-02 01:35:44,336] INFO Cleaning for partition audit-0 is resumed (kafka.log.LogManager)
[2023-03-02 01:35:47,354] ERROR [ReplicaFetcher replicaId=10, leaderId=12, fetcherId=0] Unexpected error occurred while processing data for partition audit-0 at offset 1204 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append to audit-0: ArrayBuffer(4261767024, 4260664912, 4257444208, 4254469712, 4249029224, 4245726928, 4292105504, 4292764624, 4293840640, 4287736024, 4283938960, 4284397184, 4282781528, 4284262896, 4284440184, 4284152856, 4281486352, 4281755032, 4280343536, 4280463696)
at kafka.log.Log.append(Log.scala:1249)
at kafka.log.Log.appendAsFollower(Log.scala:1155)
at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:1023)
at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:1030)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:178)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:356)
at scala.Option.foreach(Option.scala:437)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:345)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:344)
at kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
at scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:344)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140)
at scala.Option.foreach(Option.scala:437)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2023-03-02 01:35:47,355] WARN [ReplicaFetcher replicaId=10, leaderId=12, fetcherId=0] Partition audit-0 marked as failed (kafka.server.ReplicaFetcherThread)
Saddly I have lost logs from docu01 since I restarted the broken and it started logging this (there are dozes of thousands like this per second and so I lost the logs):
[2023-03-02 02:58:48,775] INFO Loaded member MemberMetadata(memberId=consumer-billing-service-operator-consumer-1-e1610a46-e6cc-41eb-8085-fd30bf5c5fa6, groupInstanceId=None, clientId=consumer-billing-service-operator-consumer-1, clientHost=/172.16.1.10, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(range), ) in group billing-service-operator-consumer with generation 1. (kafka.coordinator.group.GroupMetadata$)
[2023-03-02 02:58:48,775] INFO Loaded member MemberMetadata(memberId=consumer-audit-service-game-consumer-2-df28351e-c0da-4df1-9d39-8e50f39f7923, groupInstanceId=None, clientId=consumer-audit-service-game-consumer-2, clientHost=/172.16.1.10, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(range), ) in group audit-service-game-consumer with generation 1. (kafka.coordinator.group.GroupMetadata$)
[2023-03-02 02:58:48,775] INFO Loaded member MemberMetadata(memberId=consumer-billing-service-operator-consumer-1-e1610a46-e6cc-41eb-8085-fd30bf5c5fa6, groupInstanceId=None, clientId=consumer-billing-service-operator-consumer-1, clientHost=/172.16.1.10, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(range), ) in group billing-service-operator-consumer with generation 1. (kafka.coordinator.group.GroupMetadata$)
After restarting again the brokers, some partitions were deleted automatically. Other partitioni with problems are consumer offsets and here is what its folder looks like:
-rw-r--r-- 1 kafka kafka 0 mar 2 01:42 00000000000000000000.index
-rw-r--r-- 1 kafka kafka 0 mar 2 01:42 00000000000000000000.log
-rw-r--r-- 1 kafka kafka 0 mar 2 01:42 00000000000000000000.timeindex
-rw-r--r-- 1 kafka kafka 0 mar 2 01:42 00000000004254710584.index
-rw-r--r-- 1 kafka kafka 105 mar 2 01:42 00000000004254710584.log
-rw-r--r-- 1 kafka kafka 10 mar 2 02:36 00000000004254710584.snapshot
-rw-r--r-- 1 kafka kafka 12 mar 2 01:42 00000000004254710584.timeindex
-rw-r--r-- 1 kafka kafka 10M mar 2 02:37 00000000004277661384.index
-rw-r--r-- 1 kafka kafka 138 mar 2 01:42 00000000004277661384.log
-rw-r--r-- 1 kafka kafka 10 mar 2 02:36 00000000004277661384.snapshot
-rw-r--r-- 1 kafka kafka 10M mar 2 02:37 00000000004277661384.timeindex
-rw-r--r-- 1 kafka kafka 10 mar 2 02:36 00000000004277661385.snapshot
-rw-r--r-- 1 kafka kafka 21 mar 2 02:37 leader-epoch-checkpoint
$cat leader-epoch-checkpoint
0
2
0 0
2 4277661385
Seems like a problem with docu01 but I have no clue about what could happend. Not only jumping from 1000 to 4 billions but the following messages and no monotonically increasing offsets. The are not even increasing sometimes as you can see in the error message:
ArrayBuffer(4261767024, 4260664912, 4257444208, 4254469712, 4249029224, 4245726928, 4292105504, 4292764624, 4293840640, 4287736024, 4283938960, 4284397184, 4282781528, 4284262896, 4284440184, 4284152856, 4281486352, 4281755032, 4280343536, 4280463696)
Any ideas what happened here? Anything would be really appreciated
Thank you