Kafka Connect Distributed Worker

Currently I have a distributed worker running on linux and sending json to ES using the ES connector.
If I wanted to make this highly available I would create another linux server running the distributed worker. Now if i run the distributed worker on both machines with an ES connector set up for both when I send the message both workers will send the same message to ES and ES will delete documents that are the same.

Is this how it is expected to work or is there a configuration that needs to be done?

It sounds like you’ve not formed the workers into a Kafka Connect group. When configured correctly the work is split across tasks which execute on one, or the other, worker - not both (unless there is more than one task).

Have a look at Common mistakes made when configuring multiple Kafka Connect workers and https://docs.confluent.io/home/connect/userguide.html#distributed-mode

1 Like

Great. Thank you for that information.

I have seem to get a lot further now. I now see the leader URL of the IP of one of the workers in both of the workers logs.

When I send data though it does not duplicate but it seems like it only send to one of the workers.

When I use this rest.advertised.host.name in the worker config files I am just putting the IP address of the linux machine. I assume that the leaderURL is just picked as a master.

What would cause it not to round robin?

I think I got it. The tasks.max was set to 1. Should this be set to a specific number?

In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task will receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it’s reading.

If you change the partition count of the topic you’ll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.

If there are multiple connect workers the tasks will attempt to be distributed across the workers.

(adapted from a SO answer I made a few years back)

1 Like

Thank you.
So if I have two workers what is recommended on the amount of tasks to set. Is it better to have more tasks then partitions?

My understanding is that there’s no point setting more tasks than partitions.

1 Like

Hi,
when running workers in Distributed mode, GROUP_ID: 4 is important one, it should match to the first worker and is required in order to determine the Cluster that the worker will be part off.

Great. Thanks everyone. I think I am good now. I will continue to do some more testing. If I run into any issues I will let you know.

1 Like

A post was split to a new topic: Kafka Connect Elasticsearch sink stops sending records

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.