Could not find offset index file

chromy96 · 31 May 2022 12:31

Hi,

Often when we run a rolling restart of our clusters, we see the following error in the log

Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/0000000000000123.log, recovering segment and rebuilding index files…

I have two questions regarding this:

What is causing the .index file not to be created in the first place? Meaning, how can we fix this?
how is this affecting the cluster and its performance?

Clusters are on-prem, running on VMs using RHEL8.

mmuehlbeyer · 31 May 2022 12:46

hey @chromy96

welcome to the forum

could you provide some details about your clusters?
e.g.onprem or cloud ? kubernetes or docker?

it may be related to an unclean shutdown, e.g. if systemd does not wait for a graceful shutdown of the brokers.

best,
michael

chromy96 · 1 June 2022 06:25

Hi @mmuehlbeyer,

Thanks for the quick response. I’ve updated the original post with more details. I’ll check the system logs to see if there’s a forced shutdown.

mmuehlbeyer · 1 June 2022 06:39

hi @chromy96

thanks for the update.
in this case systemd might be the root cause.

a quick thing to test is to increase

TimeoutStopSec

in your systemd config file to let’s say 300 or 600 seconds.

best,
michael

chromy96 · 1 June 2022 08:16

Based on the logs, it doesn’t look like this is caused by an eager systemd. In Kafka logs I see that the broker is stopped cleanly before it is started again (and before we get the “Could not find … file” error):

An example:

[2022-05-25 14:44:20,954] INFO [KafkaServer id=3] shut down completed (kafka.server.KafkaServer)
...
[2022-05-25 14:44:26,365] ERROR [LogLoader partition=mytopic-0, dir=/var/lib/kafka/data] Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/00000000000000000757.log, recovering segment and rebuilding index files... (kafka.log.LogLoader$)

mmuehlbeyer · 1 June 2022 09:02

ok I see
broker start is though successful I guess?

and could you check log directory after stopping the broker
and check which file exists?

best,
michael

chromy96 · 1 June 2022 12:53

Yes, brokers start successfully and data is available. The only things we see are errors in the logs.

Here’s an example of one of the folders reporting errors.
First, after the broker is stopped but before start:

├── mytopic-0
│   ├── 00000000000000059749.snapshot
│   ├── 00000000000000060394.log
│   ├── 00000000000000060394.snapshot
│   ├── 00000000000000060394.timeindex
│   ├── leader-epoch-checkpoint
│   └── partition.metadata

After start:

├── mytopic-0
│   ├── 00000000000000060394.index
│   ├── 00000000000000060394.log
│   ├── 00000000000000060394.timeindex
│   ├── leader-epoch-checkpoint
│   └── partition.metadata

mmuehlbeyer · 1 June 2022 13:14

ok thx
and which file the brokers was complaining about now?

chromy96 · 1 June 2022 13:30

In the last example, it would be the index file for mytopic-0/00000000000000060394.log, ref error below:

Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/00000000000000060394.log, recovering segment and rebuilding index files...

As seen from the before start section above, the file 00000000000000060394.index is missing after the server stop. It is created by the broker after it is started again.

mmuehlbeyer · 1 June 2022 14:04

hmm ok
the behaviour is described in
Log · The Internals of Apache Kafka → Loading Log Files

though I’m not 100% sure whether there is s.th going on during shutdown.
how did stop the cluster?

with

kafka-server-stop.sh ?

could you check whether the following parameter is in place?

controlled.shutdown.enable=true

best,
michael

chromy96 · 1 June 2022 14:18

We’re using cp-ansible to handle provisioning and rolling restarts. From what I see in the code, it’s using systemd to restart the brokers.

I also tried stopping them using systemctl stop confluent-kafka which causes the same error on the start.

The broker config does not have the controlled.shutdown.enable=true parameter set by default in cp-ansible. I can try setting it and will report back.

chromy96 · 2 June 2022 08:15

Checked now,
controlled.shutdown.enable=true is set by default, so setting it explicitly didn’t change anything. The problem still persists.

mmuehlbeyer · 2 June 2022 09:21

ok I see

tried to reproduce locally but wasn’t able to reproduce.

out of curiosity which version are you using? latest?
could you share more details of your logfiles?

and are any producer or consumer connected during the restart?

and another question: how much broker exist in your cluster? 3?

best,
michael

chromy96 · 3 June 2022 08:43

We’re using confluent platform 7.1.1, but I see this log also back when we used 6.1 and 6.2. This is happening in all our clusters (dev and prod) where we have producers and consumers connected during restart.

We have between 3 and 6 brokers in our clusters.

Regarding logs, I can share it, but not sure which part is interesting to you.

[2022-06-02 09:54:34,919] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-06-02 09:54:34,926] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2022-06-02 09:54:34,927] INFO [KafkaServer id=1] Starting controlled shutdown (kafka.server.KafkaServer)
[2022-06-02 09:54:35,063] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions HashSet(mytopic-6,__consumer_offsets-6) (kafka.server.ReplicaFetcherManager)
...
[2022-06-02 09:54:36,046] INFO [Controller id=1, targetBrokerId=1] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2022-06-02 09:54:36,046] WARN [Controller id=1, targetBrokerId=1] Connection to node 1 (brd81.example.com/1.2.3.4:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2022-06-02 09:54:36,046] INFO [Controller id=1, targetBrokerId=1] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkClient)
...
[2022-06-02 09:54:36,090] INFO [ProducerStateManager partition=mytopic-2] Wrote producer snapshot at offset 34699 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager)
...
[2022-06-02 09:54:39,892] INFO Cruise Control metrics reporter exited. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
[2022-06-02 09:54:39,894] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO App info kafka.producer for CruiseControlMetricsReporter unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-06-02 09:54:39,894] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,895] INFO Broker and topic stats closed (kafka.server.BrokerTopicStats)
[2022-06-02 09:54:39,895] INFO App info kafka.server for 1 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-06-02 09:54:39,895] INFO [KafkaServer id=1] shut down completed (kafka.server.KafkaServer)

...

[2022-06-02 09:54:40,788] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-06-02 09:54:41,118] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-06-02 09:54:41,121] INFO starting (kafka.server.KafkaServer)
...
[2022-06-02 09:54:41,702] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2022-06-02 09:54:41,766] INFO [feature-zk-node-event-process-thread]: Starting (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)
...
[2022-06-02 09:54:41,978] INFO [ThrottledChannelReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,979] INFO [ThrottledChannelReaper-Produce]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,980] INFO [ThrottledChannelReaper-Request]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,981] INFO [ThrottledChannelReaper-ControllerMutation]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:42,038] INFO Loading logs from log dirs ArraySeq(/var/lib/kafka/data) (kafka.log.LogManager)
[2022-06-02 09:54:42,041] INFO Skipping recovery for all logs in /var/lib/kafka/data since clean shutdown file was found (kafka.log.LogManager)
...
[2022-06-02 09:54:42,267] ERROR [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-6/00000000000000033668.log, recovering segment and rebuilding index files... (kafka.log.LogLoader$)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Loading producer state till offset 33668 with message format version 2 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Reloading from producer snapshot and rebuilding producer state from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,268] INFO [ProducerStateManager partition=mytopic-6] Loading producer state from snapshot file 'SnapshotFile(/var/lib/kafka/data/mytopic-6/00000000000000033668.snapshot,33668)' (kafka.log.ProducerStateManager)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Producer state recovery took 0ms for snapshot load and 0ms for segment recovery from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Loading producer state till offset 33668 with message format version 2 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Reloading from producer snapshot and rebuilding producer state from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO Deleted producer state snapshot /var/lib/kafka/data/mytopic-6/00000000000000033668.snapshot (kafka.log.SnapshotFile)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Producer state recovery took 0ms for snapshot load and 0ms for segment recovery from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,274] INFO Completed load of Log(dir=/var/lib/kafka/data/mytopic-6, topicId=..., topic=mytopic, partition=6, highWatermark=33668, lastStableOffset=33668, logStartOffset=33668, logEndOffset=33668) with 1 segments in 7ms (44/127 loaded in /var/lib/kafka/data) (kafka.log.LogManager)

mmuehlbeyer · 3 June 2022 09:59

hi @chromy96

ok I see.

so afaik the issue is related to the fact that consumers/producers are connected during restart.

the *.snapshot files contain snapshots of the producer state regarding sequence IDs used to avoid duplicate records.

and if you stop the broker the have to be cleaned up during restart

i think the only thing to overcome is to drop all producer/consumer connecting before restarting the cluster

hth,
michael

chromy96 · 3 June 2022 10:21

Is there a recommended way to do this? My (obviously wrong ) assumption was that the brokers will do that.
Besides the standard producers, we also have e.g. Cruise Control’s metrics reporter, which runs as a broker plugin.

One more question: Since I don’t quite understand the connection between snapshot files and index files, could you point me to the documentation explaining this?

mmuehlbeyer · 3 June 2022 10:52

basically stop all producers and consumers and then shutdown the brokers

see

https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-log-Log.html#recoverSegment

an issue I stumbled over
https://issues.apache.org/jira/browse/KAFKA-5829

let me know if you need more info

hth,
michael

chromy96 · 3 June 2022 11:39

Thanks for the links, Michael. Yet, we’re not doing the full shutdown of the cluster, we only ever perform rolling restarts when updating configs or upgrading the cluster.

The whole point with the rolling restart is not to affect our producers and consumers, so the approach of stopping them before doing a rolling restart is not feasible.

mmuehlbeyer · 3 June 2022 12:43

ok I see

in this case I would go with
https://docs.confluent.io/platform/current/kafka/post-deployment.html#rolling-restart

best,
michael

Topic		Replies	Views
Broker checkpoint .tmp files stuck Ops	3	2570	23 February 2023
IO Error in Log Cleanup Ops	3	6539	24 March 2022
NoSuchFileException in Log Cleanup Operation Ops	9	5045	4 April 2022
Kafka Checkpoint Files Ops	0	3713	31 August 2022
Broker restart loop after hard shutdown Ops	24	70	3 July 2025

Could not find offset index file

Related topics