Hi,
while trying to identify the source of our current production issues I ran kafka-producer-perf-test to see if I can find the root cause.
The tests results are abysmal on the new cluster and I wonder what the problem can be.
cp-kafka 7.8.0-83
(Abbreviated)
kafka-producer-perf-test --topic perf-test --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=xyz
First run:
24465 records sent, 4883.2 records/sec (4.77 MB/sec), 7136.4 ms avg latency, 38990.0 ms max latency.
20850 records sent, 4150.9 records/sec (4.05 MB/sec), 1471.3 ms avg latency, 39879.0 ms max latency.
44925 records sent, 8920.8 records/sec (8.71 MB/sec), 11301.4 ms avg latency, 44340.0 ms max latency.
[…]
26100 records sent, 5195.1 records/sec (5.07 MB/sec), 5650.0 ms avg latency, 34524.0 ms max latency.
14012 records sent, 2776.9 records/sec (2.71 MB/sec), 1974.8 ms avg latency, 25993.0 ms max latency.
1500 records sent, 179.4 records/sec (0.18 MB/sec), 22930.8 ms avg latency, 35995.0 ms max latency.
[2025-02-10 08:08:37,224] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,227] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,228] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 39883 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 08:08:37,230] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 41022 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 08:08:37,230] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 41022 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT. Error Message: Disconnected from node 5 due to timeout (org.apache.kafka.clients.producer.internals.Sender)
3840 records sent, 590.9 records/sec (0.58 MB/sec), 26160.8 ms avg latency, 35987.0 ms max latency.
6450 records sent, 1290.0 records/sec (1.26 MB/sec), 44149.3 ms avg latency, 46324.0 ms max latency.
1800 records sent, 360.0 records/sec (0.35 MB/sec), 28511.1 ms avg latency, 30733.0 ms max latency.
1695 records sent, 338.9 records/sec (0.33 MB/sec), 41356.0 ms avg latency, 55843.0 ms max latency.
450 records sent, 90.0 records/sec (0.09 MB/sec), 35325.4 ms avg latency, 40219.0 ms max latency.
6750 records sent, 710.5 records/sec (0.69 MB/sec), 54548.6 ms avg latency, 65638.0 ms max latency.
3000000 records sent, 4289.807988 records/sec (4.19 MB/sec), 6925.13 ms avg latency, 90370.00 ms max latency, 1103 ms 50th, 45097 ms 95th, 65680 ms 99th, 86804 ms 99.9th.
Second run:
11040 records sent, 1051.8 records/sec (1.03 MB/sec), 59511.6 ms avg latency, 62459.0 ms max latency.
6720 records sent, 639.8 records/sec (0.62 MB/sec), 48488.2 ms avg latency, 52053.0 ms max latency.
240 records sent, 24.0 records/sec (0.02 MB/sec), 44462.7 ms avg latency, 51950.0 ms max latency.
2640 records sent, 251.4 records/sec (0.25 MB/sec), 52263.8 ms avg latency, 62129.0 ms max latency.
1920 records sent, 191.8 records/sec (0.19 MB/sec), 62162.5 ms avg latency, 71824.0 ms max latency.
11280 records sent, 1026.7 records/sec (1.00 MB/sec), 64941.7 ms avg latency, 72138.0 ms max latency.
8400 records sent, 800.0 records/sec (0.78 MB/sec), 61670.5 ms avg latency, 62434.0 ms max latency.
4800 records sent, 457.1 records/sec (0.45 MB/sec), 62630.5 ms avg latency, 72458.0 ms max latency.
720 records sent, 142.7 records/sec (0.14 MB/sec), 72430.5 ms avg latency, 77435.0 ms max latency.
4800 records sent, 482.1 records/sec (0.47 MB/sec), 58416.1 ms avg latency, 77536.0 ms max latency.
[2025-02-10 09:02:35,733] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,736] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,737] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-12 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 877 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,740] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-6 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-7, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 891 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,777] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-12 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 878 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,778] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-6 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,780] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-9, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-7, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-15, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,781] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 892 on topic-partition perf-test-12, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,791] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 889 on topic-partition perf-test-6, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-4, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-4 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-0, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Received invalid metadata error in produce request on partition perf-test-13 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2025-02-10 09:02:35,793] WARN [Producer clientId=perf-producer-client] Got error produce response with correlation id 879 on topic-partition perf-test-5, retrying (2147483646 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[…]
59610 records sent, 11829.7 records/sec (11.55 MB/sec), 18874.0 ms avg latency, 57004.0 ms max latency.
32325 records sent, 6448.2 records/sec (6.30 MB/sec), 2746.6 ms avg latency, 6770.0 ms max latency.
28605 records sent, 5689.1 records/sec (5.56 MB/sec), 2810.9 ms avg latency, 11445.0 ms max latency.
27450 records sent, 5486.7 records/sec (5.36 MB/sec), 1875.1 ms avg latency, 16713.0 ms max latency.
21555 records sent, 2905.0 records/sec (2.84 MB/sec), 5430.8 ms avg latency, 25933.0 ms max latency.
6705 records sent, 1218.9 records/sec (1.19 MB/sec), 16136.1 ms avg latency, 30170.0 ms max latency.
21450 records sent, 4265.3 records/sec (4.17 MB/sec), 3751.1 ms avg latency, 33900.0 ms max latency.
24705 records sent, 4512.3 records/sec (4.41 MB/sec), 5282.0 ms avg latency, 33946.0 ms max latency.
This is the old cluster (recreated for this test):
kafka_2.13-3.7.0
/products/kafka_2.13-3.7.0/bin # ./kafka-producer-perf-test.sh --topic perf-test --throughput -1 --num-records 3000000 --record-size 1024 --producer-props acks=all bootstrap.servers=simal80:19092,simal81:19092,simal82:19092
211763 records sent, 42352.6 records/sec (41.36 MB/sec), 603.2 ms avg latency, 1104.0 ms max latency.
388785 records sent, 77757.0 records/sec (75.93 MB/sec), 399.8 ms avg latency, 698.0 ms max latency.
449550 records sent, 89892.0 records/sec (87.79 MB/sec), 342.4 ms avg latency, 578.0 ms max latency.
481155 records sent, 96211.8 records/sec (93.96 MB/sec), 319.2 ms avg latency, 590.0 ms max latency.
441480 records sent, 88296.0 records/sec (86.23 MB/sec), 345.7 ms avg latency, 792.0 ms max latency.
449085 records sent, 89799.0 records/sec (87.69 MB/sec), 340.8 ms avg latency, 688.0 ms max latency.
435885 records sent, 87177.0 records/sec (85.13 MB/sec), 351.6 ms avg latency, 1035.0 ms max latency.
The old version has about 10 to 15 times the performance, cp-kafka eventually errors out with timeouts or OUT_OF_ORDER_SEQUENCE_NUMBER errors, so basically exactly what we see in production.
I reran the old cluster test with a significantly larger sample size to make sure the timeouts are not an external network issue that was only circumvented due to fast test completion - it performed flawlessly and consistently.
The question now is why this would happen with the Conflent Kafka cluster?
I am using the exact same box here, both Kafka cluster run in parallel with different ports on the same 3 boxes.
Any idea what can be the cause here?
Thanks