Hello!
We are running couple of Kafka Streams processors on top of AWS ECS, and as some of them deals with lots of state (tens of GBs), we are keen to optimise restoration times in case of restarts/rebalancing. In order to make it happen, we have provisioned Elastic File System. Unfortunately, just after applying the change we have started receiving Caused by: java.io.IOException: Stale file handle
errors .
Full trace
org.apache.kafka.streams.errors.StreamsException: stream-thread [XXX.internal-StreamThread-1] task [0_0] Fatal error while trying to lock the state directory for task 0_0
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:95)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:209)
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:473)
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:728)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:625)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:512)
Caused by: java.io.IOException: Stale file handle
at java.base/sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:96)
at java.base/sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1161)
at java.base/java.nio.channels.FileChannel.tryLock(FileChannel.java:1165)
at org.apache.kafka.streams.processor.internals.StateDirectory.tryLock(StateDirectory.java:446)
at org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:213)
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:90)
... 6 common frames omitted
This situation happens regardless of EFS performanceMode - we have tested both: general purspose and maxIO. It seems there is an issue obtaining the lock during rebalancing. Just after that log event we can observe whole state is wiped and pulled over again from changelog topics. As ultimately we would like to have autoscaling of our services enabled, rebalancing taking 10 minutes or even more is definitely something we would like to avoid.
Does anyone have some experience or battle tested pattern of deploying Kafka Streams services with persistent storage attached on ECS? I presume using EBS might be an option here, but haven’t tested it yet. In the other project I have used Kubernetes and persistent volumes, which worked like a charm, and would like to mimic same experience using ECS.