Kafka Connect is a high-performance and scalable solution for transferring data to and from Kafka. However, getting started can be a bit confusing, as the online documentation quickly gets bogged down in details and lacks a basic overview to quickly find your way around. This article is intended to help you understand the basic concepts that will allow you to quickly create a simple Connect setup. The documentation for Kafka Connect can be found here.
What is Confluent Kafka Connect?
As mentioned above, Kafka Connect is a product from Confluent for transferring data to and from Kafka. Connect runs on the Confluent platform and is purely configurable . It is not necessary to program a single line of code. Through its Connectors, Connect provides connectivity to data sources such as all major databases, messaging systems, REST interfaces, cloud storage and much more. The list of connectable systems currently contains more than 80 entries and is still growing (see here). In addition, there are other functionalities for transforming and enriching the data or performing logical operations that are not discussed in this article.
Why should I use Kafka Connect?
Most pragmatic developers like me will initially think that it’s very easy to write a small service that connects to Kafka using a REST interface, for example. So why should I bother to learn about Kafka Connect? Experience shows that this is true at first, but as the project progresses, new requirements are added, such as correct further data reading after a restart, scalability, security aspects, etc., and so the small service quickly turns into a larger project. Because Kafka Connect takes all of this into account, such features can be added in minutes with simple configurations, saving a lot of time and effort.
Basic concepts
To get started with Connect, you need to understand the basic architecture. There are so-called Workers, Tasks and Connectors.
Workers
Workers can be thought of as the execution environment on which data streaming takes place. There are standalone workers and distributed workers. Standalone Workers are not recommended by Confluent for production use, but rather for testing in a local environment. For this reason, we will not discuss them further here. Distributed Workers run in a Kafka cluster on the Confluent platform. One or more workers form a Connect Cluster. They run in a network and divide the work independently. This work is called Task.
Tasks
Tasks are the work that needs to be done, such as reading data from the source system, transforming data, and writing data to the target system. Tasks can be executed sequentially or in parallel. Connectors are used to connect to the source/target systems.
Connectors
A connector is a plugin that a Worker downloads from the Internet or a configured location at startup to read data from a source or write data to a target. There are source connectors for connecting to source systems and sink connectors for connecting to target systems. If a connector is neither a source nor a sink, it can be configured as both.
Configuration
First the workers must be configured. To do this, 5 entries in the configuration are sufficient:
bootstrap.servers
group.id
config.storage.topic
offset.storage.topic
status.storage.topic
bootstrap.servers
simply refers to the Kafka cluster on which to run the Workers .
group.id
is the consumer groupid that the workers should use. All Workers that form a group in the Connect Cluster are defined by the same group.id
. There is a small pitfall here that occasionally causes headaches: the group.id
must not be reused thoughout the Confluent platform, for example for a normal Kafka consumer that reads from a different topic. Otherwise the Workers will not start.
The three topic
configurations must point to existing Kafka topics where the Connector configuration, read offsets, and status are stored.
The next step is to configure the Tasks and Connectors. The Connector configurations vary greatly depending on the system you are connecting to. Only the following settings are common to all:
name
connector.class
tasks.max
Here,
name
is an arbitrary but unique name for the Connector.conntor.class
identifies the connector to use.tasks.max
is the number of Tasks the Connector can use.
These settings are now packed into a JSON and sent via http POST to the URL where the connector can be reached (see here). This JSON is stored in the config.storage.topic
mentioned above, so that on subsequent restarts or when new workers are added, the configuration is taken from the topic.
Since this procedure is manual, a much better approach would be to store the JSON in a Kubernetes secret, for example, and inform the workers via configuration that the connector is configured with the JSON from the secret.
Conclusion
And that’s it! With just a few configurations, you can set up a simple Kafka Connect cluster with Worker , Tasks and Connector that is ready for production use, scalable, resilient and restartable.
Further features and information
Security settings and advanced features such as data transformation are not considered. Please refer to the docs on security and transforms.
Credits
Title image by baranozdemir on Getty Images