Error downgrading to community

jnpmarques-alpt · 5 November 2024 09:50

Hi,

I’m downgrading from confluent platform to community version and I’m getting this error on kraft (using the same version 7.7.1):

[2024-11-04 09:34:59,294] ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
java.lang.RuntimeException: No FeatureLevelRecord for metadata.version was found in the bootstrap metadata from the binary bootstrap metadata file: /opt/kafka/kraft/bootstrap.checkpoint
       at org.apache.kafka.metadata.bootstrap.BootstrapMetadata.fromRecords(BootstrapMetadata.java:60)
       at org.apache.kafka.metadata.bootstrap.BootstrapDirectory.readFromBinaryFile(BootstrapDirectory.java:107)
       at org.apache.kafka.metadata.bootstrap.BootstrapDirectory.read(BootstrapDirectory.java:79)
       at kafka.server.KafkaRaftServer$.initializeLogDirs(KafkaRaftServer.scala:198)
       at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:60)
       at kafka.Kafka$.buildServer(Kafka.scala:82)
       at kafka.Kafka$.main(Kafka.scala:90)
       at kafka.Kafka.main(Kafka.scala)

Anyone know what is this about and what should I do to downgrade?

Thanks in advance

mmuehlbeyer · 5 November 2024 12:29

Hi @jnpmarques-alpt

just be sure that my understanding is correct.
you switched binaries from CP 7.7.1 to Confluent Community version 7.7.1?

you could check the current version with
kafka-metadata-quorum --bootstrap-controller <controller_host>:<controller_port> describe --status

jnpmarques-alpt · 5 November 2024 12:58

Hi,

Yes, that is right. I’m trying to switch from CP to Community Version both 7.7.1.

I’ve already done that but I got nothing that points out to the version.

kafka-metadata-quorum --bootstrap-controller localhost:24175 describe --status
ClusterId:              ANjcq6rmRWi-vtwu9mqIgA
LeaderId:               4
LeaderEpoch:            220
HighWatermark:          2320751
MaxFollowerLag:         0
MaxFollowerLagTimeMs:   0
CurrentVoters:          [4,5,6]
CurrentObservers:       [1,3]

But the binaries that are running go by the version:

kafka-topics --version
7.7.1-ce

and the new ones:

./kafka-topics --version
7.7.1-ccs

So, the replacement version is the same of the one that is running.

mmuehlbeyer · 5 November 2024 13:06

ok and you came from KRaft or switching from ZK?

jnpmarques-alpt · 5 November 2024 13:11

Full story is,
I’ve migrated recently from ZK and because of a miscommunication with the previous cluster owner I switched to cp instead of keeping it community. It somehow worked fine for 2 weeks but now after a restart one of the brokers doesn’t start (as expected) due to licensing issues. I’m now trying to go back to Community version as it should be in the first place.

mmuehlbeyer · 5 November 2024 13:21

I see
same issue on all brokers or not?

jnpmarques-alpt · 5 November 2024 13:29

If I try to do a rolling restart the brokers fail to start. First Exception in the logs is:

ERROR Encountered metadata loading fault: Unhandled fault in MetadataLoader#handleLoadSnapshot. Snapshot offset was 2355281 (org.apache.kafka.server.fault.LoggingFaultHandler)
org.apache.kafka.server.common.serialization.MetadataParseException: org.apache.kafka.common.errors.UnsupportedVersionException: Unknown metadata id 10005
        at org.apache.kafka.server.common.serialization.AbstractApiMessageSerde.read(AbstractApiMessageSerde.java:99)
        at org.apache.kafka.server.common.serialization.AbstractApiMessageSerde.read(AbstractApiMessageSerde.java:43)
        at org.apache.kafka.raft.internals.RecordsIterator.decodeDataRecord(RecordsIterator.java:340)
        at org.apache.kafka.raft.internals.RecordsIterator.readRecord(RecordsIterator.java:312)
        at org.apache.kafka.raft.internals.RecordsIterator.readBatch(RecordsIterator.java:230)
        at org.apache.kafka.raft.internals.RecordsIterator.nextBatch(RecordsIterator.java:194)
        at org.apache.kafka.raft.internals.RecordsIterator.hasNext(RecordsIterator.java:89)
        at org.apache.kafka.snapshot.RecordsSnapshotReader.nextBatch(RecordsSnapshotReader.java:126)
        at org.apache.kafka.snapshot.RecordsSnapshotReader.hasNext(RecordsSnapshotReader.java:86)
        at org.apache.kafka.image.loader.MetadataLoader.loadSnapshot(MetadataLoader.java:422)
        at org.apache.kafka.image.loader.MetadataLoader.lambda$handleLoadSnapshot$2(MetadataLoader.java:388)
        at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.kafka.common.errors.UnsupportedVersionException: Unknown metadata id 10005

then I get:

[2024-11-05 13:26:42,674] ERROR [BrokerLifecycleManager id=2] Shutting down because we were unable to register with the controller quorum. (kafka.server.BrokerLifecycleManager)
[2024-11-05 13:26:42,675] INFO [BrokerLifecycleManager id=2] registrationTimeout: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-11-05 13:26:42,675] INFO [BrokerLifecycleManager id=2] Transitioning from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
[2024-11-05 13:26:42,678] ERROR [BrokerServer id=2] Received a fatal error while waiting for the controller to acknowledge that we are caught up (kafka.server.BrokerServer)
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478)
        at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:586)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-11-05 13:26:42,679] INFO [broker-2-to-controller-heartbeat-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,679] INFO [broker-2-to-controller-heartbeat-channel-manager]: Stopped (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,680] INFO [BrokerServer id=2] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2024-11-05 13:26:42,680] INFO [broker-2-to-controller-heartbeat-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,690] ERROR [BrokerServer id=2] Fatal error during broker startup. Prepare to shutdown (kafka.server.BrokerServer)
java.lang.RuntimeException: Received a fatal error while waiting for the controller to acknowledge that we are caught up
        at org.apache.kafka.server.util.FutureUtils.waitWithLogging(FutureUtils.java:68)
        at kafka.server.BrokerServer.startup(BrokerServer.scala:500)
        at kafka.server.KafkaRaftServer.$anonfun$startup$2(KafkaRaftServer.scala:99)
        at kafka.server.KafkaRaftServer.$anonfun$startup$2$adapted(KafkaRaftServer.scala:99)
        at scala.Option.foreach(Option.scala:437)
        at kafka.server.KafkaRaftServer.startup(KafkaRaftServer.scala:99)
        at kafka.Kafka$.main(Kafka.scala:112)
        at kafka.Kafka.main(Kafka.scala)
Caused by: java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478)
        at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:586)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-11-05 13:26:42,690] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2024-11-05 13:26:42,691] INFO [BrokerServer id=2] shutting down (kafka.server.BrokerServer)

mmuehlbeyer · 5 November 2024 13:38

ok need to check or try by myself

KRaft is running fine I assume?

jnpmarques-alpt · 5 November 2024 13:40

Yes, Kraft is working fine.

jnpmarques-alpt · 6 November 2024 12:43

Update: Since I can’t change the controllers to community version I tried to do a full broker cluster restart with community and some brokers logged:

[2024-11-06 11:55:40,562] INFO [BrokerLifecycleManager id=1] Unable to register broker 1 because the controller returned error UNSUPPORTED_VERSION (kafka.server.BrokerLifecycleManager)

Since that failed reverted back to paid version and somehow the broker that with problems started.

But now I have 2 brokers constantly rebalancing groups with NOT_COORDINATOR error.

[2024-11-06 12:41:47,283] INFO [GroupCoordinator 1]: Preparing to rebalance group clu-nprd.consumer in state PreparingRebalance with old generation 10319 (__consumer_offsets-40) (reason: Error NOT_COORDINATOR when storing group assignment during SyncGroup (member: ckpnxs70-2-0-61499373-e198-46c9-9c0d-62846b3404ad)) (kafka.coordinator.group.GroupCoordinator)

From the client side (kafka connect in this case) it is logging:

[2024-11-06 12:48:04,484] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Request joining group due to: rebalance failed due to 'This is not the correct coordinator.' (NotCoordinatorException) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:1102)
[2024-11-06 12:48:04,584] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Client requested disconnect from node 2147483645 (org.apache.kafka.clients.NetworkClient:397)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Discovered group coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:936)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Group coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) is unavailable or invalid due to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be attempted. (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:999)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Requesting disconnect from last known coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:1012)

jnpmarques-alpt · 7 November 2024 12:16

Another update,

I’ve noticed that now some log dirs are appended with -stary suffix on the brokers with issues. What is that about? I can’t find any documentation about it.

mmuehlbeyer · 7 November 2024 12:25

broker are not starting up or are ready to work?

jnpmarques-alpt · 7 November 2024 12:28

They startup and respond but the issue with the rebalance keeps happening and I suspect they can’t access the partitions correctly

mmuehlbeyer · 7 November 2024 12:31

no errors in the logs?

jnpmarques-alpt · 7 November 2024 12:32

just the rebalance ones

mmuehlbeyer · 7 November 2024 12:34

then basically the cluster should be fine

are you aware of this Consumer group?

clu-nprd.consume

jnpmarques-alpt · 7 November 2024 13:00

Yes, I am aware of that and multiple others with the same error. But two brokers with REBALANCING issue, a 3rd that is fine and the consumption of topics is intermitent (mostly not working) I can’t say the clusters are working just fine.

And I keep have the issue of can’t downgrade to community due to the kraft FeatureLevelRecord issue. And even if I can do that I still don’t know if I could do the downgrade to community of the brokers themselves.

I am trying to rename all the -stray directories to the original name and I seem to be getting some proggress.

jnpmarques-alpt · 7 November 2024 13:14

Yep,

It seems that renaming all the *-stray log directories to their original name did the trick of restoring the cluster functionality. I still don’t know if I lose any data in the process.

I still don’t know which process did the renaming niether why did it do it. Do you have any clues about that? I need to investigate further so that doesn’t happen again.

mmuehlbeyer · 7 November 2024 15:12

seems to be something related to this

https://issues.apache.org/jira/browse/KAFKA-13972

jnpmarques-alpt · 7 November 2024 15:20

Just saw that issue. Well, I could replicate what happened, don’t know what might cause it but:

have a CP 7.7.1 broker running
stop the broker and switch for Confluent Community 7.7.1 (keep the controllers as CP since you can’t go back because of the FeatureLevelRecord metadata issue)
2.1 You will see the logs of KRAFT

[2024-11-07 15:09:57,441] INFO [BrokerLifecycleManager id=1] Unable to register broker 1 because the controller returned error UNSUPPORTED_VERSION (kafka.server.BrokerLifecycleManager)

Revert back to CP 7.7.1 and all the partitions will be marked as stray

So right now it seems you can’t switch back to community once you upgraded to CP.

Topic		Replies	Views
Broker restart loop after hard shutdown Ops	25	73	3 July 2025
Kafka-broker-api-versions Ops	5	4781	27 February 2023
Can't migrate from zookeeper to KRaft Containers	1	206	30 December 2024
Metadata version problems on same host/docker image Ops	4	561	7 February 2025
KAFKA cluster with kraft Kafka Streams	5	11248	31 August 2023

Error downgrading to community

Related topics