Error downgrading to community

Hi,

I’m downgrading from confluent platform to community version and I’m getting this error on kraft (using the same version 7.7.1):

[2024-11-04 09:34:59,294] ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
java.lang.RuntimeException: No FeatureLevelRecord for metadata.version was found in the bootstrap metadata from the binary bootstrap metadata file: /opt/kafka/kraft/bootstrap.checkpoint
       at org.apache.kafka.metadata.bootstrap.BootstrapMetadata.fromRecords(BootstrapMetadata.java:60)
       at org.apache.kafka.metadata.bootstrap.BootstrapDirectory.readFromBinaryFile(BootstrapDirectory.java:107)
       at org.apache.kafka.metadata.bootstrap.BootstrapDirectory.read(BootstrapDirectory.java:79)
       at kafka.server.KafkaRaftServer$.initializeLogDirs(KafkaRaftServer.scala:198)
       at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:60)
       at kafka.Kafka$.buildServer(Kafka.scala:82)
       at kafka.Kafka$.main(Kafka.scala:90)
       at kafka.Kafka.main(Kafka.scala)

Anyone know what is this about and what should I do to downgrade?

Thanks in advance

Hi @jnpmarques-alpt

just be sure that my understanding is correct.
you switched binaries from CP 7.7.1 to Confluent Community version 7.7.1?

you could check the current version with
kafka-metadata-quorum --bootstrap-controller <controller_host>:<controller_port> describe --status

Hi,

Yes, that is right. I’m trying to switch from CP to Community Version both 7.7.1.

I’ve already done that but I got nothing that points out to the version.

kafka-metadata-quorum --bootstrap-controller localhost:24175 describe --status
ClusterId:              ANjcq6rmRWi-vtwu9mqIgA
LeaderId:               4
LeaderEpoch:            220
HighWatermark:          2320751
MaxFollowerLag:         0
MaxFollowerLagTimeMs:   0
CurrentVoters:          [4,5,6]
CurrentObservers:       [1,3]

But the binaries that are running go by the version:

kafka-topics --version
7.7.1-ce

and the new ones:

./kafka-topics --version
7.7.1-ccs

So, the replacement version is the same of the one that is running.

ok and you came from KRaft or switching from ZK?

Full story is,
I’ve migrated recently from ZK and because of a miscommunication with the previous cluster owner I switched to cp instead of keeping it community. It somehow worked fine for 2 weeks but now after a restart one of the brokers doesn’t start (as expected) due to licensing issues. I’m now trying to go back to Community version as it should be in the first place.

I see
same issue on all brokers or not?

If I try to do a rolling restart the brokers fail to start. First Exception in the logs is:

ERROR Encountered metadata loading fault: Unhandled fault in MetadataLoader#handleLoadSnapshot. Snapshot offset was 2355281 (org.apache.kafka.server.fault.LoggingFaultHandler)
org.apache.kafka.server.common.serialization.MetadataParseException: org.apache.kafka.common.errors.UnsupportedVersionException: Unknown metadata id 10005
        at org.apache.kafka.server.common.serialization.AbstractApiMessageSerde.read(AbstractApiMessageSerde.java:99)
        at org.apache.kafka.server.common.serialization.AbstractApiMessageSerde.read(AbstractApiMessageSerde.java:43)
        at org.apache.kafka.raft.internals.RecordsIterator.decodeDataRecord(RecordsIterator.java:340)
        at org.apache.kafka.raft.internals.RecordsIterator.readRecord(RecordsIterator.java:312)
        at org.apache.kafka.raft.internals.RecordsIterator.readBatch(RecordsIterator.java:230)
        at org.apache.kafka.raft.internals.RecordsIterator.nextBatch(RecordsIterator.java:194)
        at org.apache.kafka.raft.internals.RecordsIterator.hasNext(RecordsIterator.java:89)
        at org.apache.kafka.snapshot.RecordsSnapshotReader.nextBatch(RecordsSnapshotReader.java:126)
        at org.apache.kafka.snapshot.RecordsSnapshotReader.hasNext(RecordsSnapshotReader.java:86)
        at org.apache.kafka.image.loader.MetadataLoader.loadSnapshot(MetadataLoader.java:422)
        at org.apache.kafka.image.loader.MetadataLoader.lambda$handleLoadSnapshot$2(MetadataLoader.java:388)
        at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.kafka.common.errors.UnsupportedVersionException: Unknown metadata id 10005

then I get:

[2024-11-05 13:26:42,674] ERROR [BrokerLifecycleManager id=2] Shutting down because we were unable to register with the controller quorum. (kafka.server.BrokerLifecycleManager)
[2024-11-05 13:26:42,675] INFO [BrokerLifecycleManager id=2] registrationTimeout: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue)
[2024-11-05 13:26:42,675] INFO [BrokerLifecycleManager id=2] Transitioning from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
[2024-11-05 13:26:42,678] ERROR [BrokerServer id=2] Received a fatal error while waiting for the controller to acknowledge that we are caught up (kafka.server.BrokerServer)
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478)
        at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:586)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-11-05 13:26:42,679] INFO [broker-2-to-controller-heartbeat-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,679] INFO [broker-2-to-controller-heartbeat-channel-manager]: Stopped (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,680] INFO [BrokerServer id=2] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2024-11-05 13:26:42,680] INFO [broker-2-to-controller-heartbeat-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread)
[2024-11-05 13:26:42,690] ERROR [BrokerServer id=2] Fatal error during broker startup. Prepare to shutdown (kafka.server.BrokerServer)
java.lang.RuntimeException: Received a fatal error while waiting for the controller to acknowledge that we are caught up
        at org.apache.kafka.server.util.FutureUtils.waitWithLogging(FutureUtils.java:68)
        at kafka.server.BrokerServer.startup(BrokerServer.scala:500)
        at kafka.server.KafkaRaftServer.$anonfun$startup$2(KafkaRaftServer.scala:99)
        at kafka.server.KafkaRaftServer.$anonfun$startup$2$adapted(KafkaRaftServer.scala:99)
        at scala.Option.foreach(Option.scala:437)
        at kafka.server.KafkaRaftServer.startup(KafkaRaftServer.scala:99)
        at kafka.Kafka$.main(Kafka.scala:112)
        at kafka.Kafka.main(Kafka.scala)
Caused by: java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478)
        at kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:586)
        at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-11-05 13:26:42,690] INFO [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
[2024-11-05 13:26:42,691] INFO [BrokerServer id=2] shutting down (kafka.server.BrokerServer)

ok need to check or try by myself

KRaft is running fine I assume?

Yes, Kraft is working fine.

Update: Since I can’t change the controllers to community version I tried to do a full broker cluster restart with community and some brokers logged:

[2024-11-06 11:55:40,562] INFO [BrokerLifecycleManager id=1] Unable to register broker 1 because the controller returned error UNSUPPORTED_VERSION (kafka.server.BrokerLifecycleManager)

Since that failed reverted back to paid version and somehow the broker that with problems started.

But now I have 2 brokers constantly rebalancing groups with NOT_COORDINATOR error.

[2024-11-06 12:41:47,283] INFO [GroupCoordinator 1]: Preparing to rebalance group clu-nprd.consumer in state PreparingRebalance with old generation 10319 (__consumer_offsets-40) (reason: Error NOT_COORDINATOR when storing group assignment during SyncGroup (member: ckpnxs70-2-0-61499373-e198-46c9-9c0d-62846b3404ad)) (kafka.coordinator.group.GroupCoordinator)

From the client side (kafka connect in this case) it is logging:

[2024-11-06 12:48:04,484] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Request joining group due to: rebalance failed due to 'This is not the correct coordinator.' (NotCoordinatorException) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:1102)
[2024-11-06 12:48:04,584] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Client requested disconnect from node 2147483645 (org.apache.kafka.clients.NetworkClient:397)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Discovered group coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:936)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Group coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) is unavailable or invalid due to cause: coordinator unavailable. isDisconnected: false. Rediscovery will be attempted. (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:999)
[2024-11-06 12:48:04,585] INFO [Worker clientId=connect-0.0.0.0:25083, groupId=connect-cluster] Requesting disconnect from last known coordinator 10.114.106.5:25170 (id: 2147483645 rack: null) (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:1012)

Another update,

I’ve noticed that now some log dirs are appended with -stary suffix on the brokers with issues. What is that about? I can’t find any documentation about it.

broker are not starting up or are ready to work?

They startup and respond but the issue with the rebalance keeps happening and I suspect they can’t access the partitions correctly

no errors in the logs?

just the rebalance ones

then basically the cluster should be fine

are you aware of this Consumer group?

clu-nprd.consume

Yes, I am aware of that and multiple others with the same error. But two brokers with REBALANCING issue, a 3rd that is fine and the consumption of topics is intermitent (mostly not working) I can’t say the clusters are working just fine.

And I keep have the issue of can’t downgrade to community due to the kraft FeatureLevelRecord issue. And even if I can do that I still don’t know if I could do the downgrade to community of the brokers themselves.

I am trying to rename all the -stray directories to the original name and I seem to be getting some proggress.

Yep,

It seems that renaming all the *-stray log directories to their original name did the trick of restoring the cluster functionality. I still don’t know if I lose any data in the process.

I still don’t know which process did the renaming niether why did it do it. Do you have any clues about that? I need to investigate further so that doesn’t happen again.

seems to be something related to this

https://issues.apache.org/jira/browse/KAFKA-13972

Just saw that issue. Well, I could replicate what happened, don’t know what might cause it but:

  1. have a CP 7.7.1 broker running
  2. stop the broker and switch for Confluent Community 7.7.1 (keep the controllers as CP since you can’t go back because of the FeatureLevelRecord metadata issue)
    2.1 You will see the logs of KRAFT
[2024-11-07 15:09:57,441] INFO [BrokerLifecycleManager id=1] Unable to register broker 1 because the controller returned error UNSUPPORTED_VERSION (kafka.server.BrokerLifecycleManager)
  1. Revert back to CP 7.7.1 and all the partitions will be marked as stray

So right now it seems you can’t switch back to community once you upgraded to CP.