Kafka Brokers are crashed after some (7) days

Hi @rmoff
I deploy my Kafka cluster on Linux Kubernetes before that its running on Linux Docker VM. When I migrated that cluster to K8 after some days it causes and issue. My Kubernetes Cluster is on Azure Kubernetes Service and I am using Azure FIles as Persistence Volume. Here is the error that kafka brokers:

[2023-01-17 18:30:03,433] ERROR Failed to clean up log for ovx_connect_configs-0 in dir /var/lib/kafka/data due to IOException (kafka.server.LogDirFailureChannel)

java.nio.file.FileSystemException: /var/lib/kafka/data/ovx_connect_configs-0/00000000000000000000.log.cleaned: Operation not permitted

at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)

at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)

at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)

at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.setTimes(UnixFileAttributeViews.java:125)

at java.base/java.nio.file.Files.setLastModifiedTime(Files.java:2355)

at kafka.log.LogSegment.lastModified_$eq(LogSegment.scala:651)

at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:610)

at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:539)

at kafka.log.Cleaner.doClean(LogCleaner.scala:538)

at kafka.log.Cleaner.clean(LogCleaner.scala:512)

at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:381)

at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:353)

at kafka.log.LogCleaner$CleanerThread.tryCleanFilthiestLog(LogCleaner.scala:333)

at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:322)

at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

[2023-01-17 18:30:03,839] ERROR Failed to clean up log for ovx_connect_configs-0 in dir /var/lib/kafka/data due to IOException (kafka.server.LogDirFailureChannel)

java.nio.file.NoSuchFileException: /var/lib/kafka/data/ovx_connect_configs-0/00000000000000000000.log.cleaned

at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)

at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)

at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)

at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)

at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)

at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)

at org.apache.kafka.common.record.FileRecords.openChannel(FileRecords.java:451)

at org.apache.kafka.common.record.FileRecords.open(FileRecords.java:414)

at kafka.log.LogSegment$.open(LogSegment.scala:664)

at kafka.log.LogCleaner$.createNewCleanedSegment(LogCleaner.scala:457)

at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:567)

at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:539)

at kafka.log.Cleaner.doClean(LogCleaner.scala:538)

at kafka.log.Cleaner.clean(LogCleaner.scala:512)

at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:381)

at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:353)

at kafka.log.LogCleaner$CleanerThread.tryCleanFilthiestLog(LogCleaner.scala:333)

at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:322)

at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

[2023-01-17 18:30:03,864] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)

[2023-01-17 18:30:03,876] WARN [ReplicaManager broker=1] Stopping serving replicas in dir /var/lib/kafka/data (kafka.server.ReplicaManager)

[2023-01-17 18:30:03,899] INFO [broker-1-to-controller-send-thread]: Starting (kafka.server.BrokerToControllerRequestThread)

[2023-01-17 18:30:03,935] WARN [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions and stopped moving logs for partitions because they are in the failed log directory /var/lib/kafka/data. (kafka.server.ReplicaManager)

[2023-01-17 18:30:03,935] WARN Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)

[2023-01-17 18:30:03,946] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogManager)

I too got the same issue after the broker failed due to a disk issue (No space on disk). After restarting it recovered.

Wondering if there’s any way to mitigate these kind of scenarios.