High CPU utilization is often related to context switching — where the CPU core must be switching between different tasks and, doing so, burns CPU clocks. I would start investigating how much idle time each user task is having.
In a machine with 8 cores, you are creating some context switching by dividing ˜25/8. Ideally, the number of tasks should be equal to the number of topics to read from — so a virtual thread can be created in one of the workers. To double-check this thesis, minimize the # of topics to be read to 8 and see what happens.
Other common sources of High CPU usage are:
Frequent rebalances (remember that sinks create Kafka Consumers)
Retries. It might not have new records to pull from, but some may be retried
Physical versus virtual threads. Is it 8 CPU cores or 4 with Hyper-Threading?
Virtualization overhead. Is the machine bare metal or running on a hypervisor?
Container overhead. Is the machine bare metal or a container with CPU slots?
It might be worth revisiting questions like compression and SSL, as they enable the clients to use additional CPU.
If none of this works; you might need to pull the sleeves up and perform a lone-wolf debugging. Single topic, with a single task, single ack, single partition, etc. — and a debugger enabled pointing to the same code version. FWIW, here is the path to the Elasticsearch connector code:
Finally, don’t underestimate the effect that poor network connectivity has on this type of architecture. If the Kafka Connect workers and the Kafka clusters are separate, look at how many idle network sockets are created between then and how frequently data packets are being transferred. This can be a huge bummer in poor network connections (e.g. VPN, Gateways, etc).
Number of tasks were based on number of partition for a given topic as a general recommendation. Here in my case it looks like some timer is expiring very frequently. Setting linger.ms setting to higher value has brought down the cpu utilization. This specific setting is related to how much time the task should wait for more records(batch) before committing data to elastic search. Not sure why this timer is in action even though there are no records to process. The default value was 1ms changed it to 300ms and the cpu utilization has come down but still around 5% even when there are no records. I don’t want to increase the timer as it might lead into memory utilisation when there too much of data.
Any recommendation regarding some suitable timeouts will help.