Kafka Connect Becomes Unhealthy Due to OOM

Hi Folks,

We have a simple single node setup to replicate data from an SQL server to Snowflake using debezium source and snowflake sink connectors. Ocassionally the sink connectors failed throwing an out of memory error. These could be easily resolved by restarting the connectors, however we decided to increase the heap size using KAFKA_HEAP_OPTS="-Xms256M -Xmx16G".

The kafka connect container was re-created when heap size was increased, everything ran fine for a few hours and then the connect worker became unhealthy. Logs show that pretty much all tasks are failing with out of memory errors. Restarting the connect worker helps for about 10 minutes and we start seeing the same behavior all over.

In addition to connectors failing, we noticed the following errors in the logs.

ERROR Unexpected exception in Thread[KafkaBasedLog Work Thread - ConnectConfigs,5,main] (org.apache.kafka.connect.util.KafkaBasedLog)
java.lang.OutOfMemoryError: Java heap space

ERROR Unexpected exception in Thread[KafkaBasedLog Work Thread - ConnectOffsets,5,main] (org.apache.kafka.connect.util.KafkaBasedLog)
java.lang.OutOfMemoryError: Java heap space

WARN Could not stop task (org.apache.kafka.connect.runtime.WorkerSourceTask)
java.lang.OutOfMemoryError: Java heap space
Exception in thread ā€œKafkaBasedLog Work Thread - ConnectStatusā€ java.lang.OutOfMemoryError: Java heap space

Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) java.lang.OutOfMemoryError: Java heap space

Connectors failing due to heap space issues was something we could easily handle but now since increasing the heap space the connect worker itself is failing. Restarting and increasing or decreasing heap size is not helping at all. REST API is unresponsive when the connect worker becomes unhealthy.

We are in a POC stage working with 300+ tables, things were going running great with a daily load of 40 million events until we made this change.

confluent version: 6.2.0

connect-01:
container_name: connect-01
image: confluentinc/cp-kafka-connect:${KAFKA_VERSION}
networks:
- pipeline
restart: unless-stopped
depends_on:
- kafka
- zookeeper
ports:
- 8083:8083
environment:
CONNECT_GROUP_ID: 1
CONNECT_REST_PORT: 8083
CONNECT_BOOTSTRAP_SERVERS: ā€˜kafka:9092ā€™
CONNECT_REST_ADVERTISED_HOST_NAME: connect-01
CONNECT_STATUS_STORAGE_TOPIC: ConnectStatus
CONNECT_OFFSET_STORAGE_TOPIC: ConnectOffsets
CONNECT_CONFIG_STORAGE_TOPIC: ConnectConfigs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_KEY_CONVERTER: ā€˜org.apache.kafka.connect.json.JsonConverterā€™
CONNECT_INTERNAL_VALUE_CONVERTER: ā€˜org.apache.kafka.connect.json.JsonConverterā€™
CONNECT_PLUGIN_PATH: /usr/share/java,/etc/kafka-connect/jars/
KAFKA_HEAP_OPTS: ${CONNECT_HEAP_SETTINGS}
volumes:
- $PWD/plugins/debezium/$DBZ_VERSION:/etc/kafka-connect/jars/debezium
- $PWD/plugins/snowflake/$SF_VERSION:/etc/kafka-connect/jars/snowflake

I would highly appreciate any advise on resolving this issue. Thanks for your time.

-Shiva

I would need to see more metrics from the connect cluster. But in general this issue is due to SQL server sending CDC events to the connect workers faster than they can get the data out to kafka brokers. You should look at CPU utilization on the connect workers, as well as the io-ratio and io-wait-ratio.

1 Like

Hi Mitchell,

Thanks for your response. What you said does match up what we are seeing. Thereā€™s not much user activity on the weekend, for two weeks in a row connect worker instance did not crash over the weekend.

However, the connect worker crashes every 30-50 minutes on weekdays. The odd thing is that we scheduled SQL Server capture job to run only for 4 hours a day, but the connect worker keeps crashing through out the day. When thereā€™s no data the logs we see look like this.

[2021-09-21 18:54:11,374] INFO WorkerSourceTask{id=Source_PRDB02_fiNext_PG_24-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2021-09-21 18:54:11,384] INFO WorkerSourceTask{id=Source_PRDB02_fiNext_PG_19-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2021-09-21 18:54:11,384] INFO WorkerSourceTask{id=Source_PRDB02_fiNext_PG_20-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)

The logs indicate that thereā€™s no incoming data and everything seems to be running smoothly then we see the following logs.

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "kafka-producer-network-thread | PRDB02_fiNext-dbhistory"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "kafka-producer-network-thread | PRDB02_fiNext-dbhistory"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "kafka-producer-network-thread | PRDB02_fiNext-dbhistory"

After this point the REST API becomes unresponsive. The CPU usage jumps from averaging less than 10% across all 16 cores to over 90%. We setup a job to restart the connect worker when this happens.

We observed the following with the container stats from the moment the connect worker is restarted up until the point it becomes unresponsive. Note the change in NET I/O, although logs donā€™t show any indication of data being fetched from anywhere.

CONTAINER ID   NAME         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O   PIDS
6edbb374c785   connect-01   0.64%     2.065GiB / 62.82GiB   3.29%     21.7MB / 900kB   0B / 41kB   292
.
.
.
CONTAINER ID   NAME         CPU %      MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O   PIDS
6edbb374c785   connect-01   1255.66%   17.38GiB / 62.82GiB   27.66%    17.1GB / 142MB   0B / 41kB   314

The connect worker crashes when the NET I/O reaches ~17GB/140MB as seen above. This is when the heap settings have 16GB set as the maximum. When we set it 8GB the worker becomes unresponsive when the stats reach the following figures.

CONTAINER ID   NAME         CPU %      MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
67dd18c9ca98   connect-01   1148.50%   9.003GiB / 62.82GiB   14.33%    8.49GB / 70.9MB   1.12MB / 36.9kB   309

I hope this information is helpful. Please let me know what other data I can gather in order to find the root cause of this issue.

-Shiva