Elastic search sink connector showing high CPU usage

I am using elastic search sink connector in distributed mode(2 instance). With task of 8 and about 20 to 25 topics to be sink’ed to elastic search.

Even when there are no records to sink, the worker java process is showing 100% CPU usage.

The end to end transfer of records is happening properly, but high CPU usage is a concern.

my settings:

connector config
{
“name”: “elasticsearch-sink”,
“config”: {
“connector.class”: “io.confluent.connect.elasticsearch.ElasticsearchSinkConnector”,
“tasks.max”: “8”,
“topics.regex”:“(mytopics_\d+$)”,
“key.ignore”: “true”,
“schema.ignore”: “true”,
“connection.url”: “http://eshost:esport”,
“type.name”: “kafka-connect”
}
}

Worker settings:

bootstrap.servers=localhost:9094,localhost:9095
group.id=test-cluster
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
status.storage.partitions=8
rest.port=9034
plugin.path=/pluginpath
log4j.rootLogger=DEBUG, stdout

I’m trying it on a server grade setup (64GB RAM and 8 CPU cores) and there is good connectivity to kafka as well as es server.

Any pointers will help.

Thanks in advance.

High CPU utilization is often related to context switching — where the CPU core must be switching between different tasks and, doing so, burns CPU clocks. I would start investigating how much idle time each user task is having.

In a machine with 8 cores, you are creating some context switching by dividing ˜25/8. Ideally, the number of tasks should be equal to the number of topics to read from — so a virtual thread can be created in one of the workers. To double-check this thesis, minimize the # of topics to be read to 8 and see what happens.

Other common sources of High CPU usage are:

  • Frequent rebalances (remember that sinks create Kafka Consumers)
  • Retries. It might not have new records to pull from, but some may be retried
  • Physical versus virtual threads. Is it 8 CPU cores or 4 with Hyper-Threading?
  • Virtualization overhead. Is the machine bare metal or running on a hypervisor?
  • Container overhead. Is the machine bare metal or a container with CPU slots?

It might be worth revisiting questions like compression and SSL, as they enable the clients to use additional CPU.

If none of this works; you might need to pull the sleeves up and perform a lone-wolf debugging. Single topic, with a single task, single ack, single partition, etc. — and a debugger enabled pointing to the same code version. FWIW, here is the path to the Elasticsearch connector code:

Finally, don’t underestimate the effect that poor network connectivity has on this type of architecture. If the Kafka Connect workers and the Kafka clusters are separate, look at how many idle network sockets are created between then and how frequently data packets are being transferred. This can be a huge bummer in poor network connections (e.g. VPN, Gateways, etc).

— Ricardo

1 Like

Thanks Riferrei for the guidelines.

Number of tasks were based on number of partition for a given topic as a general recommendation. Here in my case it looks like some timer is expiring very frequently. Setting linger.ms setting to higher value has brought down the cpu utilization. This specific setting is related to how much time the task should wait for more records(batch) before committing data to elastic search. Not sure why this timer is in action even though there are no records to process. The default value was 1ms changed it to 300ms and the cpu utilization has come down but still around 5% even when there are no records. I don’t want to increase the timer as it might lead into memory utilisation when there too much of data.

Any recommendation regarding some suitable timeouts will help.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.