Custom Assignment for High-Volume, Multi-Tenant Data Pipeline

Here’s a great thread on Slack that I thought would be awesome to capture here. @mitchell-h and @AlexeiZenin discuss a custom assignment strategy based on consumer lag.


Scott Woods
Hey all. We are using Kafka for a high volume, multi-tenant data pipeline. Tenant data is isolated by topic- each tenant has their own topics. As a result, the throughput per partition can vary significantly (despite our efforts to keep them uniform). This means that when we scale out our application, there can often be a heavily skewed distribution of work. This makes it difficult to scale our services predictably, and also leads to lower efficiency and higher costs. To combat this, we’ve started working on a custom assignment strategy that assigns partitions to consumer based on lag- it attempts to balance lag across consumers as evenly as possible, and thus in theory will minimize lag and improve the balance of resource utilization.Is this sort of assignment strategy (or anything similar based on some sort of throughput heuristics something) you all at Confluent (or others here in the community) have ever considered/tried? We’re frankly a bit surprised this doesn’t exist, and are somewhat worried we’re missing something and are going to run into a dead end somewhere.

Mitch Henderson
Interesting question! I’ve been part of teams that have tried this. The first problem we hit was that “lag” doesn’t necessarily mean a higher work load or throughput. Depending on what the client is doing (for example calling an external API), this outside influence could very well effect your balancing to such a degree that it turns out you’re hurting your self more then helping.

Mitch Henderson
Also, kafka clients do a whole lot of work to NOT rebalance. This has changed a good bit with incremental rebalancing, but it still is a thing.

Alexei Zenin
Interesting problem, had something similar be an issue when processing many topics trying to provide a unified service for them. Never wrote anything more advanced other than throwing more consumers at the job for it though.Another approach would be to pause consuming from certain partitions to allow service time go to other partitions. This would be much more efficient than doing regular group rebalances. (edited)

Mitch Henderson
Also, kafka CG coordinators don’t have a whole lot of information about what’s going on in the cluster.

Mitch Henderson
you almost have to call out to an external service, prometheus, to get information about throughput. So static assignment with an external to kafka consumer group coordinator makes some sense.

Scott Woods

“lag” doesn’t necessarily mean a higher work load or throughput

Yeah, this is a good point. We went this route because it was the best proxy we had for throughput due to the limitations you mentioned above, and figured it would get us closer than round robin, at least. What we’ve noticed in our deployments is that at steady state, higher throughput topics typically have higher lag than lower ones. It’s also beneficial in that another one of our goals is to remain “caught up” (possibly even more important than workload balance), which is something I didn’t mention in the original comment.

Mitch Henderson
a simpler solution might be to over partition, and over provision consumers.

Scott Woods

Also, kafka clients do a whole lot of work to NOT rebalance

We have cluster autoscaling that will force periodic rebalances anyway, and our thought was that we wouldn’t really need to trigger any additional rebalances beyond these.

a simpler solution might be to over partition, and over provision consumers.

Yeah, we have sorta already taken this approach to an extent. But we’re operating at a scale where partition proliferation can be problematic, and force us to have to upsize our kafka clusters to support, which has cost implications for us

Mitch Henderson
you mentioned in #kafka-streams that each tenant is fairly stand alone. I HIGHLY suggest if partition limits are being hit you start looking at moving to a multi-cluster solution.

Mitch Henderson
so that you’re better isolating clusters

Scott Woods
You mean a cluster per tenant? We have already sub-divided tenants into groups that get their own deployments of the apps and brokers.

Scott Woods
But, we’re hoping to make each of those groups as large as possible to limit the cost and operational overhead

Mitch Henderson
No. Something like bucket each cluster by SLA

Mitch Henderson
I understand the motivations. but often times running a single cluster is much more overhead then 3 smaller

Scott Woods
Yeah gotcha. Our SLAs are currently the same across the board, but we do have some high-value customers that get their own cluster or are part of a smaller group.

but often times running a single cluster is much more overhead then 3 smaller

Thats interesting, but makes sense. For us, its more about the overhead of managing 1 vs. n deployments of the application rather than the cluster itself.

Mitch Henderson
You’d only deploy the application against one cluster. Each cluster would hold a subset of the data universe.

Mitch Henderson
if you needed an entire view you would have that in only the clusters that needed it

Scott Woods
Right, I think we’re saying the same thing. Some set of tenant has a single deployment of the application that runs on a single kafka cluster. Another set of tenants has another app instance running on a different cluster. This is the state we’re in today

Mitch Henderson
Yes

Mitch Henderson
I’ll just quote Ryanne.

Rebooting a really large Kafka cluster can take a long time. So long, that it's actually faster to stand up a completely new cluster and mirror everything over to it. I dunno what to do with this information.

— Ryanne Dolan (@DolanRyanne) July 9, 2021

Mitch Henderson
same thing applies for almost all operational tasks with clusters >10nodes

Scott Woods
Heh, funny you mention that. Cluster failover is typically our go-to approach for any of these tasks that require reboot

1 Like