NoSuchFileException in Log Cleanup Operation

Hi,
We had a critical error in our kafka cluster yesterday. As a result, the cluster has been shutdown. And we needed to restart the kafka service immediately.
And everything works fine for now.

After that operation we tried to investigate the problem but couldn’t figure it out.
I simplified the log file and provided as an attachment. If you see it, tells us java.nio.file.NoSuchFileException: /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log

We didnt’t see any problem in disk storage. After restarted the kafka service, it shows [2022-03-18 17:49:17,573] INFO [Log partition=__consumer_offsets-40, dir=/opt/data/kafka] Found file /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.swap from interrupted swap operation. (kafka.log.Log)

So it seems, for any reason it couldn’t write data but after restart operation, it finds an interrupted operation and says that I can fix it. So we couldn’t find anything that causes this problem.

Our kafka broker runs on debian 11 as in a vm, temurin-11-jdk-amd64 is used. Kafka version is 2.2.1.

Any idea? (edited)

[2022-03-18 17:32:29,482] ERROR Failed to clean up log for __consumer_offsets-40 in dir /opt/data/kafka due to IOException (kafka.server.LogDirFailureChannel)
java.nio.file.NoSuchFileException: /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:430)
        at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1422)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:806)
        at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:224)
        at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:489)
        at kafka.log.Log.asyncDeleteSegment(Log.scala:1960)
        at kafka.log.Log.$anonfun$replaceSegments$6(Log.scala:2023)
        at kafka.log.Log.$anonfun$replaceSegments$6$adapted(Log.scala:2018)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Log.replaceSegments(Log.scala:2018)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:582)
        at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:512)
        at kafka.log.Cleaner.$anonfun$doClean$6$adapted(LogCleaner.scala:511)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:511)
        at kafka.log.Cleaner.clean(LogCleaner.scala:489)
        at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:350)
        at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:319)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
        Suppressed: java.nio.file.NoSuchFileException: /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log -> /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.deleted
                at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
                at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
                at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
                at java.base/java.nio.file.Files.move(Files.java:1422)
                at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:803)
                ... 17 more
[2022-03-18 17:32:29,534] INFO [ReplicaManager broker=24] Stopping serving replicas in dir /opt/data/kafka (kafka.server.ReplicaManager)
[2022-03-18 17:32:29,617] INFO [ReplicaFetcherManager on broker 24] Removed fetcher for partitions Set(__consumer_offsets-22, and many more )
[2022-03-18 17:32:29,643] ERROR Failed to clean up log for __consumer_offsets-40 in dir /opt/data/kafka due to IOException (kafka.server.LogDirFailureChannel)

[2022-03-18 17:32:29,643] ERROR Failed to clean up log for __consumer_offsets-40 in dir /opt/data/kafka due to IOException (kafka.server.LogDirFailureChannel)
java.nio.file.NoSuchFileException: /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.swap
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:430)
        at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1422)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:806)
        at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:224)
        at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:489)
        at kafka.log.Log.$anonfun$replaceSegments$7(Log.scala:2026)
        at kafka.log.Log.$anonfun$replaceSegments$7$adapted(Log.scala:2026)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Log.replaceSegments(Log.scala:2026)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:582)
        at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:512)
        at kafka.log.Cleaner.$anonfun$doClean$6$adapted(LogCleaner.scala:511)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:511)
        at kafka.log.Cleaner.clean(LogCleaner.scala:489)
        at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:350)
        at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:319)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
        Suppressed: java.nio.file.NoSuchFileException: /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.swap -> /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log
                at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
                at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
                at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
                at java.base/java.nio.file.Files.move(Files.java:1422)
                at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:803)
                ... 16 more

---------
        [2022-03-18 17:49:17,573] INFO [Log partition=__consumer_offsets-40, dir=/opt/data/kafka] Found file /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.swap from interrupted swap operation. (kafka.log.Log)

----------

[2022-03-18 17:50:45,024] INFO Registered broker 24 at path /brokers/ids/24 with addresses: ArrayBuffer(EndPoint(10.xxx.xxx.24,9092,ListenerName(PLAINTEXT),PLAINTEXT)), czxid (broker epoch): 25769845019 (kafka.zk.KafkaZkClient)
[2022-03-18 17:50:45,026] WARN No meta.properties file under dir /opt/data/kafka/meta.properties (kafka.server.BrokerMetadataCheckpoint)
[2022-03-18 17:50:45,087] INFO [ExpirationReaper-24-topic]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-03-18 17:50:45,095] INFO [ExpirationReaper-24-Heartbeat]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-03-18 17:50:45,095] INFO [ExpirationReaper-24-Rebalance]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2022-03-18 17:50:45,130] INFO [GroupCoordinator 24]: Starting up. (kafka.coordinator.group.GroupCoordinator)
[2022-03-18 17:50:45,131] INFO [GroupCoordinator 24]: Startup complete. (kafka.coordinator.group.GroupCoordinator)
[2022-03-18 17:50:45,135] INFO [GroupMetadataManager brokerId=24] Removed 0 expired offsets in 4 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2022-03-18 17:50:45,154] INFO [ProducerId Manager 24]: Acquired new producerId block (brokerId:24,blockStartProducerId:22000,blockEndProducerId:22999) by writing to Zk with path version 23 (kafka.coordinator.transaction.ProducerIdManager)
[2022-03-18 17:50:45,181] INFO [TransactionCoordinator id=24] Starting up. (kafka.coordinator.transaction.TransactionCoordinator)
[2022-03-18 17:50:45,183] INFO [TransactionCoordinator id=24] Startup complete. (kafka.coordinator.transaction.TransactionCoordinator)
[2022-03-18 17:50:45,184] INFO [Transaction Marker Channel Manager 24]: Starting (kafka.coordinator.transaction.TransactionMarkerChannelManager)
[2022-03-18 17:50:45,228] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
[2022-03-18 17:50:45,238] WARN [Log partition=xxxxxxx-0, dir=/opt/data/kafka] retention.ms for topic xxxxxxx is set to 432000000. It is smaller than message.timestamp.difference.max.ms's value 9223372036854775807. This may result in frequent log rolling. (kafka.log.Log)
[2022-03-18 17:50:45,252] INFO [SocketServer brokerId=24] Started data-plane processors for 1 acceptors (kafka.network.SocketServer)
[2022-03-18 17:50:45,265] INFO Kafka version: 2.2.1 (org.apache.kafka.common.utils.AppInfoParser)
[2022-03-18 17:50:45,265] INFO Kafka commitId: 55783d3133a5a49a (org.apache.kafka.common.utils.AppInfoParser)
[2022-03-18 17:50:45,267] INFO [KafkaServer id=24] started (kafka.server.KafkaServer)

---------

[2022-03-18 17:51:44,785] INFO [Log partition=__consumer_offsets-40, dir=/opt/data/kafka] Deleting segment 0 (kafka.log.Log)
[2022-03-18 17:51:44,787] INFO Deleted log /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.deleted. (kafka.log.LogSegment)
[2022-03-18 17:51:44,792] INFO Deleted offset index /opt/data/kafka/__consumer_offsets-40/00000000000000000000.index.deleted. (kafka.log.LogSegment)
[2022-03-18 17:51:44,793] INFO Deleted time index /opt/data/kafka/__consumer_offsets-40/00000000000000000000.timeindex.deleted. (kafka.log.LogSegment)

Hi @Cahit

welcome to the forum :slight_smile:

silly first question:
the offset files was available during the error?

there is a similar topic in stackoverflow

So it may be related to your retention settings.
what are the configured values of
log.retention.* ?

best,
michael

1 Like

Hi @mmuehlbeyer ,

Your question is reasonable. When the kafka cluster has shutdown, we immediately restart the service, so we didn’t have a time to check the file exists.

I’ve checked the stackoverflow topic before but it didn’t tell us a root cause. We’ve already solved the problem by restarting the kafka service but we are trying to understand the root cause so not having this problem anymore.

What could be the problem with the log retention settings? Even the file doesn’t exist, I don’t understand why Kafka tries to delete a file that doesn’t exist.

log.retention.bytes = -1
log.retention.check.interval.ms = 300000
log.retention.hours = 168
log.retention.minutes = null
log.retention.ms = null

Kafka tries to delete the file when it reaches the retention defined
168 hours in your case.

So if some process or whatever deletes the file outside Kafka , Kafka isn’t aware and tries to delete it after the defined retention.

any cronjobs on your system which might lead to such a behaviour?

best
michael

1 Like

Yes, you’re right about that. We also thought that case but couldn’t find something that might delete this file.

This log makes me suspicious:
[2022-03-18 17:49:17,573] INFO [Log partition=__consumer_offsets-40, dir=/opt/data/kafka] Found file /opt/data/kafka/__consumer_offsets-40/00000000000000000000.log.swap from interrupted swap operation. (kafka.log.Log)

We had this log after restarted the kafka service. It just makes me think that after restart it finds the file.

mmh this info might be related to the description in the following jira issue

https://issues.apache.org/jira/browse/KAFKA-2118

I’m not that deep in the Kafka code base but it seems as the swap file is created during shutdown or a cleanup process

for a better unterstanding of the internals @jaceklaskowski gitbook about kafka internals might help
https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-log-Log.html

best,
michael

2 Likes

It seems this bug has been solved in earlier versions. We use 2.2.1 version.
Still couldn’t find anything that can cause this problem :frowning:

mmh anything which stops you from upgrading to a newer kafka version?

Nothing :slight_smile: I think we’re going to upgrade as soon as possible.

sounds like a plan :slight_smile:

one thing to try if you’d like to dig a bit deeper:

use auditctl to check who modifies or deletes the file :wink:

best,
michael