Producer Performance - Differences

I am using the kafka-producer-perf-test.sh to do some performance testing on my end. My topics are hosted locally (using docker) and also on Confluent Cloud. I am trying to compare the difference in this set (mostly about the network latency and the side effects it has on the subsequent metrics).

Oddly, I am running into a problem I was not expecting - in regards to batch size for each request being sent.

My set up is as follows (in both cases)

  1. topic with 3 partitions
  2. ack=1
  3. max.in.flight.requests.per.connection=1
  4. batch.size = 16384 (default batch size)

On running the performance test, I see this difference in the batch size
Local Cluster
batch-size-avg : 46,539.622 (bytes) - roughly batch size * 3 (no of partitions)

Confluent Cluster
batch-size-avg : 15,555.225 (bytes) - roughly batch size * 1 (1/3 rd no of partitions)

What could be the reason in the decreased throughput for Confluent Cloud? It looks like batching is happening for only 1 partition for a request (instead of 3 partitions as seen for the local cluster).

I do have to mention - in both the cases, the records are being generated with the null keys. Below is the script used to run the tests

Local Cluster
/path/to/kafka-producer-perf-test.sh \
    --topic perf_test_1_replica_3_partition \
    --num-records 100000 \
    --record-size 1024 \
    --throughput -1 \
    --producer-props \
        bootstrap.servers=localhost:9092 \
        acks=1 \
        max.in.flight.requests.per.connection=1 \
        batch.size=16384 \
    --print-metrics


Confluent Cluster
/path/to/kafka-producer-perf-test.sh \
    --topic perf_test_3_replica_3_partition \
    --num-records 100000 \
    --record-size 1024 \
    --throughput -1 \
    --producer.config /path/to//producer.config \
    --producer-props bootstrap.servers=***-*****.centralus.azure.confluent.cloud:9092 \
        acks=1 \
        max.in.flight.requests.per.connection=1 \
        batch.size=16384 \
    --print-metrics

Thanks
Gautam

Hi Gautam,

one question to get your environment better:
I guess your Confluent Cloud cluster is a basic one right?

One thing to test with is the setting linger.ms.

Another one I would recommend is to test with some compression (lz4)
https://docs.confluent.io/cloud/current/client-apps/optimizing/throughput.html#compression

There are also some nice blog post around this topic:

Benchmarking Dedicated cloud cluster

Best
Michael