Could not find offset index file

Hi,

Often when we run a rolling restart of our clusters, we see the following error in the log

Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/0000000000000123.log, recovering segment and rebuilding index files…

I have two questions regarding this:

  • What is causing the .index file not to be created in the first place? Meaning, how can we fix this?
  • how is this affecting the cluster and its performance?

Clusters are on-prem, running on VMs using RHEL8.

hey @chromy96

welcome to the forum :slight_smile:

could you provide some details about your clusters?
e.g.onprem or cloud ? kubernetes or docker?

it may be related to an unclean shutdown, e.g. if systemd does not wait for a graceful shutdown of the brokers.

best,
michael

Hi @mmuehlbeyer,

Thanks for the quick response. I’ve updated the original post with more details. I’ll check the system logs to see if there’s a forced shutdown.

1 Like

hi @chromy96

thanks for the update.
in this case systemd might be the root cause.

a quick thing to test is to increase

TimeoutStopSec

in your systemd config file to let’s say 300 or 600 seconds.

best,
michael

Based on the logs, it doesn’t look like this is caused by an eager systemd. In Kafka logs I see that the broker is stopped cleanly before it is started again (and before we get the “Could not find … file” error):

An example:

[2022-05-25 14:44:20,954] INFO [KafkaServer id=3] shut down completed (kafka.server.KafkaServer)
...
[2022-05-25 14:44:26,365] ERROR [LogLoader partition=mytopic-0, dir=/var/lib/kafka/data] Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/00000000000000000757.log, recovering segment and rebuilding index files... (kafka.log.LogLoader$)

ok I see
broker start is though successful I guess?

and could you check log directory after stopping the broker
and check which file exists?

best,
michael

Yes, brokers start successfully and data is available. The only things we see are errors in the logs.

Here’s an example of one of the folders reporting errors.
First, after the broker is stopped but before start:

├── mytopic-0
│   ├── 00000000000000059749.snapshot
│   ├── 00000000000000060394.log
│   ├── 00000000000000060394.snapshot
│   ├── 00000000000000060394.timeindex
│   ├── leader-epoch-checkpoint
│   └── partition.metadata

After start:

├── mytopic-0
│   ├── 00000000000000060394.index
│   ├── 00000000000000060394.log
│   ├── 00000000000000060394.timeindex
│   ├── leader-epoch-checkpoint
│   └── partition.metadata

ok thx
and which file the brokers was complaining about now?

In the last example, it would be the index file for mytopic-0/00000000000000060394.log, ref error below:

Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-0/00000000000000060394.log, recovering segment and rebuilding index files...

As seen from the before start section above, the file 00000000000000060394.index is missing after the server stop. It is created by the broker after it is started again.

hmm ok
the behaviour is described in
Log · The Internals of Apache Kafka → Loading Log Files

though I’m not 100% sure whether there is s.th going on during shutdown.
how did stop the cluster?

with

kafka-server-stop.sh ?

could you check whether the following parameter is in place?

controlled.shutdown.enable=true

best,
michael

We’re using cp-ansible to handle provisioning and rolling restarts. From what I see in the code, it’s using systemd to restart the brokers.

I also tried stopping them using systemctl stop confluent-kafka which causes the same error on the start.

The broker config does not have the controlled.shutdown.enable=true parameter set by default in cp-ansible. I can try setting it and will report back.

1 Like

Checked now,
controlled.shutdown.enable=true is set by default, so setting it explicitly didn’t change anything. The problem still persists.

ok I see

tried to reproduce locally but wasn’t able to reproduce.

out of curiosity which version are you using? latest?
could you share more details of your logfiles?

and are any producer or consumer connected during the restart?

and another question: how much broker exist in your cluster? 3?

best,
michael

We’re using confluent platform 7.1.1, but I see this log also back when we used 6.1 and 6.2. This is happening in all our clusters (dev and prod) where we have producers and consumers connected during restart.

We have between 3 and 6 brokers in our clusters.

Regarding logs, I can share it, but not sure which part is interesting to you.

[2022-06-02 09:54:34,919] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-06-02 09:54:34,926] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2022-06-02 09:54:34,927] INFO [KafkaServer id=1] Starting controlled shutdown (kafka.server.KafkaServer)
[2022-06-02 09:54:35,063] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions HashSet(mytopic-6,__consumer_offsets-6) (kafka.server.ReplicaFetcherManager)
...
[2022-06-02 09:54:36,046] INFO [Controller id=1, targetBrokerId=1] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2022-06-02 09:54:36,046] WARN [Controller id=1, targetBrokerId=1] Connection to node 1 (brd81.example.com/1.2.3.4:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2022-06-02 09:54:36,046] INFO [Controller id=1, targetBrokerId=1] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkClient)
...
[2022-06-02 09:54:36,090] INFO [ProducerStateManager partition=mytopic-2] Wrote producer snapshot at offset 34699 with 0 producer ids in 0 ms. (kafka.log.ProducerStateManager)
...
[2022-06-02 09:54:39,892] INFO Cruise Control metrics reporter exited. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
[2022-06-02 09:54:39,894] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,894] INFO App info kafka.producer for CruiseControlMetricsReporter unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-06-02 09:54:39,894] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-06-02 09:54:39,895] INFO Broker and topic stats closed (kafka.server.BrokerTopicStats)
[2022-06-02 09:54:39,895] INFO App info kafka.server for 1 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-06-02 09:54:39,895] INFO [KafkaServer id=1] shut down completed (kafka.server.KafkaServer)

...

[2022-06-02 09:54:40,788] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-06-02 09:54:41,118] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-06-02 09:54:41,121] INFO starting (kafka.server.KafkaServer)
...
[2022-06-02 09:54:41,702] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2022-06-02 09:54:41,766] INFO [feature-zk-node-event-process-thread]: Starting (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)
...
[2022-06-02 09:54:41,978] INFO [ThrottledChannelReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,979] INFO [ThrottledChannelReaper-Produce]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,980] INFO [ThrottledChannelReaper-Request]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:41,981] INFO [ThrottledChannelReaper-ControllerMutation]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2022-06-02 09:54:42,038] INFO Loading logs from log dirs ArraySeq(/var/lib/kafka/data) (kafka.log.LogManager)
[2022-06-02 09:54:42,041] INFO Skipping recovery for all logs in /var/lib/kafka/data since clean shutdown file was found (kafka.log.LogManager)
...
[2022-06-02 09:54:42,267] ERROR [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Could not find offset index file corresponding to log file /var/lib/kafka/data/mytopic-6/00000000000000033668.log, recovering segment and rebuilding index files... (kafka.log.LogLoader$)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Loading producer state till offset 33668 with message format version 2 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Reloading from producer snapshot and rebuilding producer state from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,268] INFO [ProducerStateManager partition=mytopic-6] Loading producer state from snapshot file 'SnapshotFile(/var/lib/kafka/data/mytopic-6/00000000000000033668.snapshot,33668)' (kafka.log.ProducerStateManager)
[2022-06-02 09:54:42,268] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Producer state recovery took 0ms for snapshot load and 0ms for segment recovery from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Loading producer state till offset 33668 with message format version 2 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Reloading from producer snapshot and rebuilding producer state from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,273] INFO Deleted producer state snapshot /var/lib/kafka/data/mytopic-6/00000000000000033668.snapshot (kafka.log.SnapshotFile)
[2022-06-02 09:54:42,273] INFO [LogLoader partition=mytopic-6, dir=/var/lib/kafka/data] Producer state recovery took 0ms for snapshot load and 0ms for segment recovery from offset 33668 (kafka.log.UnifiedLog$)
[2022-06-02 09:54:42,274] INFO Completed load of Log(dir=/var/lib/kafka/data/mytopic-6, topicId=..., topic=mytopic, partition=6, highWatermark=33668, lastStableOffset=33668, logStartOffset=33668, logEndOffset=33668) with 1 segments in 7ms (44/127 loaded in /var/lib/kafka/data) (kafka.log.LogManager)

hi @chromy96

ok I see.

so afaik the issue is related to the fact that consumers/producers are connected during restart.

the *.snapshot files contain snapshots of the producer state regarding sequence IDs used to avoid duplicate records.

and if you stop the broker the have to be cleaned up during restart

i think the only thing to overcome is to drop all producer/consumer connecting before restarting the cluster

hth,
michael

Is there a recommended way to do this? My (obviously wrong :blush: ) assumption was that the brokers will do that.
Besides the standard producers, we also have e.g. Cruise Control’s metrics reporter, which runs as a broker plugin.

One more question: Since I don’t quite understand the connection between snapshot files and index files, could you point me to the documentation explaining this?

basically stop all producers and consumers and then shutdown the brokers

see

https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-log-Log.html#recoverSegment

an issue I stumbled over
https://issues.apache.org/jira/browse/KAFKA-5829

let me know if you need more info :slight_smile:

hth,
michael

Thanks for the links, Michael. Yet, we’re not doing the full shutdown of the cluster, we only ever perform rolling restarts when updating configs or upgrading the cluster.

The whole point with the rolling restart is not to affect our producers and consumers, so the approach of stopping them before doing a rolling restart is not feasible.

ok I see

in this case I would go with
https://docs.confluent.io/platform/current/kafka/post-deployment.html#rolling-restart

best,
michael