Kafka Connect Distributed Worker

robbyd · 10 February 2021 16:12

Currently I have a distributed worker running on linux and sending json to ES using the ES connector.
If I wanted to make this highly available I would create another linux server running the distributed worker. Now if i run the distributed worker on both machines with an ES connector set up for both when I send the message both workers will send the same message to ES and ES will delete documents that are the same.

Is this how it is expected to work or is there a configuration that needs to be done?

rmoff · 10 February 2021 16:19

It sounds like you’ve not formed the workers into a Kafka Connect group. When configured correctly the work is split across tasks which execute on one, or the other, worker - not both (unless there is more than one task).

Have a look at Common mistakes made when configuring multiple Kafka Connect workers and https://docs.confluent.io/home/connect/userguide.html#distributed-mode

robbyd · 10 February 2021 18:21

Great. Thank you for that information.

I have seem to get a lot further now. I now see the leader URL of the IP of one of the workers in both of the workers logs.

When I send data though it does not duplicate but it seems like it only send to one of the workers.

When I use this rest.advertised.host.name in the worker config files I am just putting the IP address of the linux machine. I assume that the leaderURL is just picked as a master.

What would cause it not to round robin?

robbyd · 10 February 2021 21:12

I think I got it. The tasks.max was set to 1. Should this be set to a specific number?

chris · 10 February 2021 22:01

In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task will receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it’s reading.

If you change the partition count of the topic you’ll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.

If there are multiple connect workers the tasks will attempt to be distributed across the workers.

(adapted from a SO answer I made a few years back)

robbyd · 10 February 2021 22:41

Thank you.
So if I have two workers what is recommended on the amount of tasks to set. Is it better to have more tasks then partitions?

rmoff · 11 February 2021 11:11

My understanding is that there’s no point setting more tasks than partitions.

waqasdilawar · 11 February 2021 11:16

Hi,
when running workers in Distributed mode, GROUP_ID: 4 is important one, it should match to the first worker and is required in order to determine the Cluster that the worker will be part off.

robbyd · 11 February 2021 12:56

Great. Thanks everyone. I think I am good now. I will continue to do some more testing. If I run into any issues I will let you know.

rmoff · 11 February 2021 19:37

A post was split to a new topic: Kafka Connect Elasticsearch sink stops sending records

system · 25 February 2021 19:37

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Different kafka cluster and different connect cluster Self-Managed Connectors	12	4632	11 June 2021
Parallelism and Load Balancing in Distributed Kafka Connect Deployment Kafka Connect	1	197	19 July 2024
✍️ Common mistakes made when configuring multiple Kafka Connect workers Kafka Connect	1	3117	9 February 2021
Kafka connector, workers Kafka Connect	2	2095	11 June 2023
Kafka Connect Elasticsearch sink stops sending records Self-Managed Connectors	13	5049	12 February 2021

Kafka Connect Distributed Worker

Related topics