Reading files into Kafka using regex for filename

In kafka connector, FileStreamSourceConnector gets data with a specific file name. Is there a connector that can read the file name in regular expression or the file under a specific directory?

check out GitHub - streamthoughts/kafka-connect-file-pulse: 🔗 A multipurpose Kafka Connect connector that makes it easy to parse, transform and stream any file, in any format, into Apache Kafka and see if that is right for you. The config file.filter.regex.pattern looks to align with what you are looking for.

https://dev.to/fhussonnois/streaming-data-into-kafka-s01-e03-loading-json-file-3d76

Check this other discussion thread here:

TL;DR: avoid using Kafka Connect for such relatively simple tasks like loading files into Kafka. Simple things should be kept simple: use Filebeat instead.

@riferrei

1 Like

OK, I’ll bite :wink: I think there’s a bit of subtly missing to this reply.

avoid using Kafka Connect for such relatively simple tasks like loading files into Kafka

Filebeat is great, but I think absolute statements like this miss the nuance and can be misleading for people new to the area.

If I already have a Kafka Connect cluster, and am familiar with Kafka Connect, I’m going to go and use one of the excellent production-ready file system connectors like SpoolDir or FilePulse etc.

If I’ve never used Kafka Connect, then for sure, Filebeat is a great option to go for.

Sorry @rmoff, but I will have to disagree with you here. :wink:

In my book, an engineering team that continues using complex distributed systems such as Kafka Connect “just because” they already possess knowledge on it is irresponsible. You see, the point should be minimizing the number of distributed systems to manage, not increasing. This is even more true while using shared Kafka Connect clusters, when decisions like sizing, tenants, partitioning, and fault-tolerance may affect more essential tasks such as — getting data out of databases and bringing it into Kafka. Furthermore, do you know when people argue that streaming ETL systems based on Kafka are incredibly complicated? I guess the point should always be then to fight against that notion, not towards it.

To make sure that my point here is not necessarily to defend the usage of Filebeat (which is an Elastic technology), I would also recommend using FluentD for things like this. Similarly simple — but not yet another distributed system to be managed. :hugs:

I guess we agree to disagree on this one :slight_smile:

yup