In our Kafka Streams + Kubernetes architecture we often come across this problem, where the StickyAssignor spreads partitions evenly, but across all topics, not per topic. Since our topics have drastically different data voulmes, this effectively disables half of our pods sometimes, and I've never understood why the StickyAssignor would be built in this way and be un-overridable as well.
So, let's say we have a Streams app that needs to merge two topics, one which is fairly low volume (let's call it A), and one which represents the "main" business data and has much higher volume, like >100x more (let's call it B). Let's give them 6 and 9 partitions, and let's have 3 pods p0, p1, p2. Topic A is materialized as an in-memory global state store, which is then queried by the records coming in topic B.
One would imagine that you would get an assignment like such:
p0: [A0, A1, B0, B1, B2]
p1: [A2, A3, B3, B4, B5]
p2: [A4, A5, B6, B7, B9]
And we do get that, initially. But, after a couple rounds of pod migration and rebalancing (happens occasionally, we use AWS spot instances) we sometimes get something more like this:
p0: [A0, A1, A2, A3, A4]
p1: [A5, B0, B1, B2, B3]
p2: [B4, B5, B6, B7, B9]
which is "balanced" according to the StickyAssignor, but not really, because all the load is on topic B and now p1 is only building the state store and distributing the changelog to the others (which is basically no work at all), while p1 and p2 do all the heavy lifting, leaving us with ~66.6% utilization or 50% if both topics had the same number of partitions.
This also happens with repartitioning prior to actual processing. If we get a lot of data at once (e.g. offset reset, ~10M records), then the original input topic is sucked dry in like 1 minute, but the repartition topic (where all the work happens) takes like 1h to process because only half the pods are assigned to it. Could have been 30m if the StickyAssignor had done its thing per-topic instead of globally.
So, main question: why does the StickyAssignor balance partitions globally and not per-topic? Does Kafka just assume all topics have equal load, or that we would configure the partition count to reflect the data volume. That might have no bearing on how much load a consumer experiences. One microservice might do a lot of work on topic A and another on topic B. It makes no sense to me, and I find it incredibly frustrating. I can't split the topics across different consumer groups either, because the whole thing has to be one topology, Anyone got a clue why the StickyAssignor is like this, or if we are making some fundamental mistake? I really don't see how a couple GB of changelog restoration once per week should outweigh having a balanced assignment on a topic that sees 5M records per day.
question from:
https://stackoverflow.com/questions/65940255/why-does-kafka-stream-sticky-assignor-not-work-per-topic