Performance Issues with confluent kafka vs older cluster

Hi,

while trying to identify the source of our current production issues I ran kafka-producer-perf-test to see if I can find the root cause.

The tests results are abysmal on the new cluster and I wonder what the problem can be.

cp-kafka 7.8.0-83

(Abbreviated)
kafka-producer-perf-test --topic perf-test --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=xyz

First run:
24465 records sent, 4883.2 records/sec (4.77 MB/sec), 7136.4 ms avg latency, 38990.0 ms max latency.
20850 records sent, 4150.9 records/sec (4.05 MB/sec), 1471.3 ms avg latency, 39879.0 ms max latency.
44925 records sent, 8920.8 records/sec (8.71 MB/sec), 11301.4 ms avg latency, 44340.0 ms max latency.
[…]
26100 records sent, 5195.1 records/sec (5.07 MB/sec), 5650.0 ms avg latency, 34524.0 ms max latency.
14012 records sent, 2776.9 records/sec (2.71 MB/sec), 1974.8 ms avg latency, 25993.0 ms max latency.
1500 records sent, 179.4 records/sec (0.18 MB/sec), 22930.8 ms avg latency, 35995.0 ms max latency.
[2025-02-10 08:08:37,224] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,227] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,228] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 08:08:37,230] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 41022 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,230] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 41022 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
3840 records sent, 590.9 records/sec (0.58 MB/sec), 26160.8 ms avg latency, 35987.0 ms max latency.
6450 records sent, 1290.0 records/sec (1.26 MB/sec), 44149.3 ms avg latency, 46324.0 ms max latency.
1800 records sent, 360.0 records/sec (0.35 MB/sec), 28511.1 ms avg latency, 30733.0 ms max latency.
1695 records sent, 338.9 records/sec (0.33 MB/sec), 41356.0 ms avg latency, 55843.0 ms max latency.
450 records sent, 90.0 records/sec (0.09 MB/sec), 35325.4 ms avg latency, 40219.0 ms max latency.
6750 records sent, 710.5 records/sec (0.69 MB/sec), 54548.6 ms avg latency, 65638.0 ms max latency.
3000000 records sent, 4289.807988 records/sec (4.19 MB/sec), 6925.13 ms avg latency, 90370.00 ms max latency, 1103 ms 50th, 45097 ms 95th, 65680 ms 99th, 86804 ms 99.9th.

Second run:
11040 records sent, 1051.8 records/sec (1.03 MB/sec), 59511.6 ms avg latency, 62459.0 ms max latency.
6720 records sent, 639.8 records/sec (0.62 MB/sec), 48488.2 ms avg latency, 52053.0 ms max latency.
240 records sent, 24.0 records/sec (0.02 MB/sec), 44462.7 ms avg latency, 51950.0 ms max latency.
2640 records sent, 251.4 records/sec (0.25 MB/sec), 52263.8 ms avg latency, 62129.0 ms max latency.
1920 records sent, 191.8 records/sec (0.19 MB/sec), 62162.5 ms avg latency, 71824.0 ms max latency.
11280 records sent, 1026.7 records/sec (1.00 MB/sec), 64941.7 ms avg latency, 72138.0 ms max latency.
8400 records sent, 800.0 records/sec (0.78 MB/sec), 61670.5 ms avg latency, 62434.0 ms max latency.
4800 records sent, 457.1 records/sec (0.45 MB/sec), 62630.5 ms avg latency, 72458.0 ms max latency.
720 records sent, 142.7 records/sec (0.14 MB/sec), 72430.5 ms avg latency, 77435.0 ms max latency.
4800 records sent, 482.1 records/sec (0.47 MB/sec), 58416.1 ms avg latency, 77536.0 ms max latency.
[2025-02-10 09:02:35,733] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,736] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,737] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-12 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-6 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-7, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-12 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-6 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,780] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-7, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,791] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 889 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-13 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
59610 records sent, 11829.7 records/sec (11.55 MB/sec), 18874.0 ms avg latency, 57004.0 ms max latency.
32325 records sent, 6448.2 records/sec (6.30 MB/sec), 2746.6 ms avg latency, 6770.0 ms max latency.
28605 records sent, 5689.1 records/sec (5.56 MB/sec), 2810.9 ms avg latency, 11445.0 ms max latency.
27450 records sent, 5486.7 records/sec (5.36 MB/sec), 1875.1 ms avg latency, 16713.0 ms max latency.
21555 records sent, 2905.0 records/sec (2.84 MB/sec), 5430.8 ms avg latency, 25933.0 ms max latency.
6705 records sent, 1218.9 records/sec (1.19 MB/sec), 16136.1 ms avg latency, 30170.0 ms max latency.
21450 records sent, 4265.3 records/sec (4.17 MB/sec), 3751.1 ms avg latency, 33900.0 ms max latency.
24705 records sent, 4512.3 records/sec (4.41 MB/sec), 5282.0 ms avg latency, 33946.0 ms max latency.

This is the old cluster (recreated for this test):

kafka_2.13-3.7.0

/products/kafka_2.13-3.7.0/bin # ./kafka-producer-perf-test.sh --topic perf-test --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=simal80:19092,simal81:19092,simal82:19092
211763 records sent, 42352.6 records/sec (41.36 MB/sec), 603.2 ms avg latency, 1104.0 ms max latency.
388785 records sent, 77757.0 records/sec (75.93 MB/sec), 399.8 ms avg latency, 698.0 ms max latency.
449550 records sent, 89892.0 records/sec (87.79 MB/sec), 342.4 ms avg latency, 578.0 ms max latency.
481155 records sent, 96211.8 records/sec (93.96 MB/sec), 319.2 ms avg latency, 590.0 ms max latency.
441480 records sent, 88296.0 records/sec (86.23 MB/sec), 345.7 ms avg latency, 792.0 ms max latency.
449085 records sent, 89799.0 records/sec (87.69 MB/sec), 340.8 ms avg latency, 688.0 ms max latency.
435885 records sent, 87177.0 records/sec (85.13 MB/sec), 351.6 ms avg latency, 1035.0 ms max latency.

The old version has about 10 to 15 times the performance, cp-kafka eventually errors out with timeouts or OUT_OF_ORDER_SEQUENCE_NUMBER errors, so basically exactly what we see in production.
I reran the old cluster test with a significantly larger sample size to make sure the timeouts are not an external network issue that was only circumvented due to fast test completion - it performed flawlessly and consistently.

The question now is why this would happen with the Conflent Kafka cluster?
I am using the exact same box here, both Kafka cluster run in parallel with different ports on the same 3 boxes.

Any idea what can be the cause here?
Thanks

Hi @Rand

hmm looks suspicious.
config is the same for both clusters I assume?

any metrics for the cluster and nodes available?

best,
michael

Hi,

yes, topics are application managed, else I am just using defaults.
I dont have metrics yet, didnt get around setting up the jmx exporter yet, its the next step after migrating production back to the old cluster.

The only thing I have are regular cpu/network metrics where it clearly shows that the nodes are not using the available resources

Blue is old cluster perf test, red is new cluster . Network looks similar

Anything in particular I should look for?

mmh the old cluster was running on OSS Kafka, right?

did you compare the server.properties?

best,
michael

I am not entirely sure what the old Kafka was based on, it was provided with the application, but yes I think they used the regular oss apache kafka.

We use your defaults for the new cluster and this is the old config file

broker.id=3
log.dirs=/data/kafka-data
zookeeper.connect=host1:2281,host2:2281,host3:2281
listeners=SSL://host1:9093,PLAINTEXT://host1:9092
replica.fetch.max.bytes=104857600
message.max.bytes=104857600
compression.type=producer
num.partitions=2
log.retention.hours=48
log.retention.check.interval.ms=300000
unclean.leader.election.enable=false
broker.id.generation.enable=false
auto.create.topics.enable=false
[SSL Stuff]

Only perf relevant parameters are the max.bytes and compression.type, but I cannot imagine that those cause 10 fold increase in performance ?

ok I see
old platform was also based on docker right?
or native deployment?

Podman actually but yes,
ZK and kafka in two containers on same host,
exactly as the new one (very same hosts for in place replacement)

I see
pretty strange
OS config is the same?
did you try a write test with dd or similar directly to the disk?

As I said, its the same host, at some point I ran old and new cluster at the same time with different ports, same behavior.

They utiize the same disk for storage, here’s a dd result

dd if=/dev/zero of=./test.dd bs=1k count=2048000
2048000+0 records in
2048000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.71327 s, 773 MB/s

dd if=/dev/urandom of=./test.dd bs=1k count=2048000
2048000+0 records in
2048000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 9.55606 s, 219 MB/s

thanks

you’ve started dd from inside the container right?

nah, that one was outside disk performance:)

inside:

sh-4.4$ dd if=/dev/zero of=./test.dd bs=1k count=2048000
2048000+0 records in
2048000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 2.79187 s, 751 MB/s
sh-4.4$ dd if=/dev/urandom of=./test.dd bs=1k count=2048000
2048000+0 records in
2048000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 9.62602 s, 218 MB/s

ok basically not that bad.

nevertheless a bit hard to diagnose without proper metrics.

another thing
did you check the client settings for
max.in.flight.requests.per.connection
I assume they did not change from old to new?

No, the client application was not touched at all for this, so whatever it set on on old was set on new during the first connect to the new cluster.

I am working on metrics (at least for the broker’s, for anything else I probably need a fw change which might take a while)

The first thing that clearly comes to mind though is the huge latency during the perf test. Can’t replicate atm due to the issue I just mentioned in the jmx post though

I started adjusting the angular dashboard so we have something to look at, but before i wal though all those I want to make sure that those are helpfull for this issue?

Edit - ok it looks like the metrics I get from the broker are not the ones that get depicted here. Not sure if that dashboard relies on client or controller data :frowning:

Edit2 -
Your kafka_config.yaml exposes 197 metrics - which ones do we need to identify the issue ?:slight_smile:

broker metrics would be fine
though some OS metrics might be also nice to have

If you can list some core ones you need to see based on your yaml that would be helpful :slight_smile:

We do have OS metrics from outside the containers as I dont think the jmx exporter is providing them

Hi,

I found a dashboard that I could adapt to your metrics (Kafka Metrics | Grafana Labs), so here you go:




This is during this test:

3000000 records sent, 4571.205671 records/sec (4.46 MB/sec), 6466.38 ms avg latency, 75577.00 ms max latency, 1156 ms 50th, 37137 ms 95th, 55930 ms 99th, 70684 ms 99.9th.

kafka-producer-perf-test --topic perf-test --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=host0:9094,host1:9094,host2:9094
46611 records sent, 9275.8 records/sec (9.06 MB/sec), 702.5 ms avg latency, 1396.0 ms max latency.
22350 records sent, 4434.5 records/sec (4.33 MB/sec), 1253.0 ms avg latency, 5676.0 ms max latency.
22440 records sent, 4486.2 records/sec (4.38 MB/sec), 2073.6 ms avg latency, 11675.0 ms max latency.
21495 records sent, 4287.9 records/sec (4.19 MB/sec), 2111.2 ms avg latency, 16665.0 ms max latency.
23850 records sent, 4750.0 records/sec (4.64 MB/sec), 1877.1 ms avg latency, 20576.0 ms max latency.
27915 records sent, 5569.6 records/sec (5.44 MB/sec), 2055.8 ms avg latency, 26724.0 ms max latency.
29816 records sent, 5915.9 records/sec (5.78 MB/sec), 6740.6 ms avg latency, 31684.0 ms max latency.
28019 records sent, 5592.6 records/sec (5.46 MB/sec), 8002.0 ms avg latency, 36664.0 ms max latency.
20100 records sent, 4008.8 records/sec (3.91 MB/sec), 4925.2 ms avg latency, 41546.0 ms max latency.
[110 records removed]
450 records sent, 90.0 records/sec (0.09 MB/sec), 27560.9 ms avg latency, 31706.0 ms max latency.
5535 records sent, 1056.1 records/sec (1.03 MB/sec), 30595.5 ms avg latency, 36251.0 ms max latency.
2775 records sent, 481.9 records/sec (0.47 MB/sec), 36403.2 ms avg latency, 41477.0 ms max latency.
5265 records sent, 1046.7 records/sec (1.02 MB/sec), 39494.6 ms avg latency, 46406.0 ms max latency.
525 records sent, 104.6 records/sec (0.10 MB/sec), 46444.5 ms avg latency, 50580.0 ms max latency.
2220 records sent, 443.9 records/sec (0.43 MB/sec), 42812.4 ms avg latency, 54113.0 ms max latency.
1875 records sent, 374.9 records/sec (0.37 MB/sec), 47841.5 ms avg latency, 55964.0 ms max latency.
3000000 records sent, 4571.205671 records/sec (4.46 MB/sec), 6466.38 ms avg latency, 75577.00 ms max latency, 1156 ms 50th, 37137 ms 95th, 55930 ms 99th, 70684 ms 99.9th.

O/c it was still running, so only partial result on the charts , but on the second image you see previous runs (red dots).

The cause (very high latency while producing data) is pretty obvious, but the question is why.
We performed another test with no replication - that one is fast

sh-4.4$ kafka-producer-perf-test --topic perf-test-r1 --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=host0:9094,host1:9094,host2:9094
904921 records sent, 180984.2 records/sec (176.74 MB/sec), 16.9 ms avg latency, 386.0 ms max latency.
997615 records sent, 199523.0 records/sec (194.85 MB/sec), 9.0 ms avg latency, 137.0 ms max latency.
1060492 records sent, 212056.0 records/sec (207.09 MB/sec), 4.0 ms avg latency, 75.0 ms max latency.
3000000 records sent, 197537.367485 records/sec (192.91 MB/sec), 9.54 ms avg latency, 386.00 ms max latency, 2 ms 50th, 55 ms 95th, 94 ms 99th, 137 ms 99.9th.

We also performed a consume test and that one is ok too

kafka-consumer-perf-test --topic perf-test -fetch-size 1024 --messages 3000000 --bootstrap-server host0:9094,host1:9094,host2:9094
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2025-02-13 11:46:12:282, 2025-02-13 11:46:52:632, 2929.6875, 72.6069, 3000000, 74349.4424, 3114, 37236, 78.6789, 80567.1930

So what is not working is produce, due to large latency, but why?

We moved all 3 VMs onto a single host with little to no CPU usage to ensure thats not it and as expected it did not change anything.

What Config setting can cause latency when pushing data to a topic?

Edit:
@mmuehlbeyer
Ran some more tests with the following setup
Old - Kafka 3.7, ZK based, 3 VMs, 1 Kafka, 1 ZK container each
New - cpKafka 3.8, KRaft based, 3 VMs, 1 Broker, 1 controller container each

4 containers per Host (running Old and New in parallel on same HW), 3 VMs

Old -
Repl=1, isr=1:
3000000 records sent, 179619.207281 records/sec (175.41 MB/sec), 10.28 ms avg latency, 575.00 ms max latency, 1 ms 50th, 55 ms 95th, 130 ms 99th, 533 ms 99.9th.

Repl=3, isr=2:
3000000 records sent, 86680.150246 records/sec (84.65 MB/sec), 343.65 ms avg latency, 1264.00 ms max latency, 308 ms 50th, 580 ms 95th, 908 ms 99th, 1137 ms 99.9th.

New -
Repl=1, isr=1:
3000000 records sent, 190222.560396 records/sec (185.76 MB/sec), 27.68 ms avg latency, 916.00 ms max latency, 2 ms 50th, 140 ms 95th, 469 ms 99th, 633 ms 99.9th.

Repl=3, isr=2:
3000000 records sent, 4343.620958 records/sec (4.24 MB/sec), 6938.76 ms avg latency, 93952.00 ms max latency, 1330 ms 50th, 39340 ms 95th, 61819 ms 99th, 90100 ms 99.9th.

So the problem is replication latency within the new cluster… but why?
And how to dig deeper?

I gave updatting to the latest build a try too, no change

Can I safely migrate to an older image, eg a 3.7 build? Or is there no turning back without recreating the cluster? The client app is compatible with a wide range, but I don’t know if they use any newer features upon initialization of the topics…

you could especially watch the network metrics

and the controller metrics

I don’t see any breaking changes between 3.7 and 3.8 though I would highly recommend to double check.

Well without knowing what to look for this is not helping much -
i do see lots of errors, eg

but why that happens? no idea. Why that happens on this Kafka version and not on the other 3.7 build - even less ideas.

I dont see anything thats wild, but then I have no idea whats normal…

At this point, I am not sure what else to look for.

I will try to downgrade next and if that does not help I’ll try Apache Kafka instead.
At least that will tell me if its a Confluent related problem or not.