Getting ready to deploy Kafka for POC

After testing Kafka Connect (source Connectors) on local VM’s (Windows 10 WSL) a client of mine is now ready to start the provisioning of Kafka in the cloud.
We will be starting with 3 Source Connectors (these will be self managed) and Kafka will need to be connected to a Spark environment on Azure HDInsight.

So the question will be:

  1. Do we go with Confluent Cloud with Azure VM’s running kafka Connect (in distributed mode) and connect Azure HDInsight spark to Confluent Cloud.
  2. Provision the Kafka environment (Workers, Zookeepers, etc) all through Azure HDInsight as we already need Azure for Self-Managed Connectors and Spark.
  3. Provision my own Confluent Cloud on Azure VM’s with Azure VM’s running kafka Connect in Disti mode.

The environment will have to reside on Azure and will not have a high data throughput load due to the nature of the data and provision cost needs to be managed closely (e.g. as cost effective as possible) as this is for a POC.
Connectors to be used are SFTP, HTTP Rest & SpoolDir.

Any advice?

Thank you,

Can you clarify the difference between options 1 and 3 on your list? In option 3 did you have in mind a self-managed Confluent deployment?


Thank you for the response.
Option 1 is using the current Confluent Cloud environment on where for me Option 3 was downloading the Confluent Community edition and install it on the customer Azure environment. But looking at the licensing structure, I do not believe this is possible. So, lets disregard. Option 3 completely.

I am currently reading your excellent blog Running a self-managed Kafka Connect worker for Confluent Cloud
This is basically my Option 1 and looks the most interesting, but due one of the connectors we will be using (SpoolDir CSV) and have pull data from a remote server using either sshfs or cifs we will not be able to use docker, I believe. Also I would like to run it as a distributed kafka connect cluster for my source connectors (I hope this makes sense).


OK, I understand better now. So you’ve got:

  1. Use Confluent Cloud
  2. Use HDInsights
  3. Self-manage Confluent Platform

You’ll need to do your own evaluation of 1 vs 2 - this might help in terms of what to look at and evaluate.

In terms of deploying Kafka Connect as a distributed cluster you can do this using Docker if you want. It depends on your target runtime environment quite how you’d do it though.


Thank you. Interesting reading and it addresses my concerns regarding sizing the correct hardware and managing a Kafka Environment.


This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.