Running Kafka Streams on k8s

Conversation from Confluent Community slack:

Andreas Evers asked:

Hi folks! I’ve been wondering about ways to run Kafka Streams with big state stores on Kubernetes. The obvious choice would seem StatefulSets that can persist state store data to disk, but that comes with a lot of operational headaches. Using stateless pods with ephemeral state would most likely cause too high restoration durations on newly started pods. Does NFS make sense to keep using Deployments? Would you need some sort of assignment table or something to keep track of which NFS folders with specific partitions should be mounted to which pods? And what about perhaps persisting the state store somewhere else than in mounted drives? I’ve watched the Kafka Summit talk on Kafka Streams + k8s by @gwenshap and @mjsax (https://www.confluent.io/kafka-summit-sf18/deploying-kafka-streams-applications/), and was wondering about any opinions and experiences in the community, or learnings from Confluent.

Matthias J Sax:

Using stateless pods with ephemeral state would most likely cause too high restoration durations on newly started pods

Standby tasks could help here.

Does NFS make sense to keep using Deployments?

I would not recommend it. Network file system often call troubles.

And what about perhaps persisting the state store somewhere else than in mounted drives?

You would need to write custom code. It might work to some extend, but there are cases in which KS relies that state is local and those “features” would break.

I’ve watched the Kafka Summit talk on Kafka Streams + k8s by @gwenshap and @mjsax (https://www.confluent.io/kafka-summit-sf18/deploying-kafka-streams-applications/),

Hope it was helpful.

The obvious choice would seem StatefulSets that can persist state store data to disk, but that comes with a lot of operational headaches.

It’s the recommended way to deploy. Not sure what you mean by “operational headaches” (I have to admit, that I am not a Kubernetes expert, so maybe my question is naive…

Nick Telford:
I’d also like to hear about these “operational headaches”, as my team is about to embark on deploying Kafka Streams (and Kafka itself) to Kubernetes via StatefulSets.

Would you need some sort of assignment table or something to keep track of which NFS folders with specific partitions should be mounted to which pods?

This sounds like manually re-creating what Kubernetes does with StatefulSets…

Thanks for sharing that talk btw, I hadn’t seen it and it’s very relevant to my team!

Vijay Nadkarni:
This is something I’ve wondered about. Thank you for posting the video, @Andreas Evers. Checking it out.

@mjsax,

I would not recommend it. Network file system often call troubles.

High latency issues of NFS?

psolomin:
indeed ephemeral volumes will cause downtime on deploy / restart. how big - can be 2 minutes, can be 20, depending on state size, pods & k8s disk, network, cpu, kafka resources. what do you mean by “big”? roughly, 300GB state can be feasible to restore in 20 min

Andreas Evers :
That restoration time sounds pretty fast actually. I didn’t realize it could be that quick.

Operational issues with StatefulSets are often related to outages and how to recover from them. Kubernetes wasn’t actually intended for stateful workloads initially, support for those got added later on. A couple of headaches that come to mind:

  • Issues with single replicas can block the entire statefulset
  • Can become problematic when nodes get added or removed
  • Autoscaling is hard to do
  • Recovering from outages can take much longer than ideal

I’m not a Kubernetes expert though, so best to crosscheck these with other opinions.

Matthias J Sax:

High latency issues of NFS?

No. Locking issues. Kafka Streams relies of locks on the file system, but NFS does not provide proper locks leading to crashs.

Andreas Evers:
This is extremely insightful. Thank you all for sharing your experiences. It sounds like ephemeral is the way to go for us, until our state stores grow too big. We’ve got a low throughput, high retention use case, which allows some delay in our asynchronous processing. If anyone has more insights, always looking to hear more :slightly_smiling_face:

Stanislav Savulchik :

@ Andreas Evers

Issues with single replicas can block the entire statefulset

If something happens with a single pod in you StatefulSet the rebalance will redistribute tasks to other pods.

Can become problematic when nodes get added or removed

Actually the same rebalance process will take care of redistributing tasks of the affected pods. In order to deal with Pending pods you will have to delete their Persistent Volume Claims in order to create them on other k8s nodes.

Autoscaling is hard to do

In my experience scaling kafka streams StatefulSet up/down is not a problem.
I successfully use the following combo:

  1. StatefulSet
  2. Local Persistent Volumes for RocksDB
  3. Kafka Streams Static Membership (group.instance.id = ${HOSTNAME})
  4. Pod Scheduling Priority = high

Andreas Evers :
Thanks for sharing @ Stanislav Savulchik. What do you mean with “Local Persistent Volumes” exactly?

Stanislav Savulchik :

Stanislav Savulchik :
It is a partition of a locally attached SSD disk to a k8s node.

Andreas Evers :
Ah like that, thanks! Never used local before, always resorted to HostPath instead.