Persistent storage for state stores on AWS ECS

Hello!

We are running couple of Kafka Streams processors on top of AWS ECS, and as some of them deals with lots of state (tens of GBs), we are keen to optimise restoration times in case of restarts/rebalancing. In order to make it happen, we have provisioned Elastic File System. Unfortunately, just after applying the change we have started receiving Caused by: java.io.IOException: Stale file handle errors .

Full trace

org.apache.kafka.streams.errors.StreamsException: stream-thread [XXX.internal-StreamThread-1] task [0_0] Fatal error while trying to lock the state directory for task 0_0
	at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:95)
	at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:209)
	at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:473)
	at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:728)
	at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:625)
	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:512)
	Caused by: java.io.IOException: Stale file handle
	at java.base/sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
	at java.base/sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:96)
	at java.base/sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1161)
	at java.base/java.nio.channels.FileChannel.tryLock(FileChannel.java:1165)
	at org.apache.kafka.streams.processor.internals.StateDirectory.tryLock(StateDirectory.java:446)
	at org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:213)
	at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:90)
	... 6 common frames omitted

This situation happens regardless of EFS performanceMode - we have tested both: general purspose and maxIO. It seems there is an issue obtaining the lock during rebalancing. Just after that log event we can observe whole state is wiped and pulled over again from changelog topics. As ultimately we would like to have autoscaling of our services enabled, rebalancing taking 10 minutes or even more is definitely something we would like to avoid.

Does anyone have some experience or battle tested pattern of deploying Kafka Streams services with persistent storage attached on ECS? I presume using EBS might be an option here, but haven’t tested it yet. In the other project I have used Kubernetes and persistent volumes, which worked like a charm, and would like to mimic same experience using ECS.

1 Like

Hey Daniel,

Sorry to say but this looks like an issue with EBS, not Kafka Streams. Are you hitting this consistently, or just from time-to-time?

The stale handle does seem like it should be recoverable, and we could just refresh and retry. But actually we’re in the middle of ripping out all these task directory locks anyway, so that should render this irrelevant. We’ll still retain a filesystem lock but it’ll be locked once and held for the lifetime of the Streams application, as opposed to the task-level locks which are frequently locked and unlocked, so you shouldn’t be impacted by these IOExceptions so much. You can keep track of this over at KAFKA-12288

Just after that log event we can observe whole state is wiped and pulled over again from changelog topics.

I take it your application is using EOS? Unfortunately there’s nothing you can do at the moment if your application has an unclean shutdown and needs to pause and restore the state stores from scratch. I did write up a quick proposal for improving this situation by falling back on standby tasks over on KAFKA-12486.

Neither of these improvements are available in any released version at this point, but I hope to get them in sometime soon. We may be able to get KAFKA-12288 at least into 2.8.0, which is currently nearing the end of the release process. It should also be available in 2.7.1, when that comes out. If you’re still running into this problem I’d recommend trying one of the versions with this fix when you get the chance

3 Likes

Hi Sophie!
Thank you for your thorough response. Yes, we are hitting it consistently, and that’s nice that locking is about to be revisited.

Out of curiosity - what is the reason for this one?

… you actually can’t run multiple instances of a Kafka Streams application on the same physical state directory

I assume that with the most recent changes to rebalancing, this is impossible indeed, as newly assigned consumer is catching up before it serves workload (thus would write to the same directory), but what about stop-the-world rebalancing? As no stream thread should be writing to the state store when partitions got revoked, in theory it may work.