This post is the result of my work with Apache Kafka and my experimentation with Kafka Connect, I tried to document my experience with the framework and hopefully, it can also act as a gentle introductory guide on how to build a Kafka Connect Source Connector.
It got a bit lengthy so if you don’t want to read everything you can jump directly to:
The necessary introduction…
Kafka Connect was introduced recently as a feature of Apache Kafka 0.9+ with the narrow (although very important) scope of copying streaming data from and to a Kafka cluster. As someone who’s working with Kafka, I found the concept really interesting and decided to experiment with Kafka Connect to see the concepts it introduced. After all when you work with a messaging system you spend quite some time thinking about how data is getting in and how it is delivered. So this post is mainly a gentle introduction to Kafka Connect including what I’ve learned through a small project of creating a Source Connector, so the information will mainly be about connectors but most of them also apply when you build Sinks.
Kafka Connect is about interacting with other data systems and move data between them and a Kafka Cluster. Many of the connectors that are available are focusing to systems that are managed by the owner of the Kafka Cluster, e.g. RDBMS systems that hold transactional data, trying to turn these systems into a stream of data. Motivated by some very interesting concepts like that of turning the database inside out these connectors are trying to get a stream out of a database either by emulating this by polling or by listening to the change events of the database as it is implemented by projects like Debezium. It makes complete sense to have such an approach because both Kafka and Kafka Connect are mainly focusing on streams of data so even the connectors that are interacting with external services are focusing on services that are delivering streams of data, e.g. Twitter. I wanted to check how I would use Kafka Connect in a use case a bit outside its original scope. So, ideally, I’d like to get data from an external data system from which I have access to and which does not deliver data as a stream. Why I find such a scenario interesting is because although streams are great, not everything is delivered as one while we don’t always have enough control over the data system to turn it into a streaming service.
So, I decided to use Mixpanel and its API for extracting event data as the source for my connector. Mixpanel is ideal as it does not make much sense to pull data from it more often than once per day due the design of the API while it has a quite simple interface to use, as we only need to reach one endpoint to get our data in JSON.
There are two main building blocks that you work with when creating a connector. The first one is … the SourceConnector J and the second one is the SourceTask. Your starting point is always the SourceConnector, which is responsible for ensuring that your connector has all the information needed to do the job, the configuration part, and for deploying the tasks that will do the actual work. So when you decide your connect the first thing you have to define is what information is needed for accessing your external data system, which of course will be part of the configuration and then decide how the actual importing work will be split into different tasks (if needed). In the simplest scenario you will have just one task to be performed and of course there are cases where you can split your work in multiple tasks or keep just one. A good example of how work can be split is the JDBC connector where a task is defined for each table that we want to pull data from. Now, it is important to understand this design choice of having multiple tasks because apart from better organizing your project it also acts as a way of parallelizing your jobs. The number of tasks that your connector will run is defined as part of your configuration but it is something that you have to consider from the beginning of your project while building your connector.
Kafka Connect can run either in a stand-alone mode or as a cluster. Ideally, you would like to go for the second choice in a production environment as it offers some nice things like scalability and fault tolerance, for development though you can stick with the stand-alone mode. Of course, you can always find much more information by reading the documentation here. If you decide to write your own connector it helps enormously to check the actual implementation of an example connector or you can even use a skeleton project for a connector.
It would help to consider the SourceConnector as the part of your code that is responsible for orchestrating the process you are defining. Its interface includes the following methods that you should implement:
From these methods inside start() happens everything related to the initialization of our connector, for example here we’ll check if we get all the information we need from the configuration file and the methods taskConfigs() is responsible for creating the configurations needed by the SourceTasks that we have implemented as part of our connector. For example, let’s consider the case of connecting to a database, we need some global configuration details like the database to connect to and the credentials needed, this work will be performed inside the start() method. As we would like to fetch data from each table in parallel, we will create one task per database table, so our SourceConnector will create one configuration per table/task inside the taskConfigs() method. Another important responsibility of the SourceConnector is to listen for changes that should trigger a reconfiguration of the running tasks, that’s another reason why it helps to consider it as an orchestrator of your tasks. What is important to remember when it comes to reconfiguring your tasks is that the implementation of how the SourceConnector will monitor and notify the framework is left up to the connector implementation, for example, you might need to implement a separate thread that will do the monitoring of the source. A good example of this concept can be found on the JDBC Connector and how it monitors the database for changes like the addition or removal of the table.
The SourceTask has the following methods that you will implement:
It’s obvious that the interfaces of SourceTask and SourceConnector are similar, they both have a start() and stop() method, probably related to the fact that both of them are threads. The start method takes as a parameter a map that contains information on how to configure the task, actually if we compare the signature of the taskConfigs method from the SourceConnector with the signature of the start method in SourceTask we can notice the similarities. The result of the taskConfig method is what is passed as a parameter to the start method of the SourceTask. Finally, we have the poll() method, in this method we should implement anything that has to do with the actual fetching of the data from the source. The runtime of Kafka Connect will call this method from each worker, get the results and ensure that everything will reach their destination inside the Kafka Cluster.
Finally, the SourceRecord represents a chunk of data as we would like to insert in Kafka, together with all the metadata needed in order to reach its proper destination. Apart from the data that will be included, for example, it might contain one row of a database table, the record also contains information about the schema, the offset and the Kafka topic where the data should be pushed. It’s important to understand that the offset here is for tracking the source and it’s not related to the offset of the destination topic. This means that we need to come up with a way of keeping an offset for tracking our position in the source, something that is case specific. For example, if we read a file the offset might be the line position in the file. Coming up with a proper offset scheme for our source is important for the consistency of our data and the performance of our connector. If we are tailing a huge log file and we have to restart our connector we’d like to avoid starting from the beginning but at the same time we’d like to know from which line to start reading again to avoid missing some data.
As we said at the beginning the purpose of our connector,Mixpanel, which can be found here, is to fetch data from Mixpanel. Ideally, we’d like to get all the events our users are generated and stored in Mixpanel, put it into Kafka and from there move it to another service. Of course, we’d like to do that as soon as possible, but getting the data as a stream is not exactly an option because of the way the Mixpanel API works. A few key design choices:
- We’ll poll the API once per day and get the data for the last 24h hours.
- As we might be operating Mixpanel for a while, we most probably have historical data there, so the very first time that our connector start operates, we’d also like to get all the data from an arbitrary date in the past until now. After that, the connector should poll for new data once per day.
- We want to make the connector as robust as possible, if for example we want to pull data from Mixpanel for the past two years we definitely don’t want to do that in one call to the API as we might have some difficulties handling the data. For this reason, we should split a date interval into days and fetch the data for each individual date in a serial way. It should be possible to be able to handle the data for a single day on a modest server, if not then my friend your product is so successful that you will have other ways of dealing with such a use case J
- We’ll keep the implementation as simple as possible focusing mainly on how Kafka Connect is architected, so we’ll use a simple Schema for our data without taking into consideration the actual schema of the Mixpanel data.
- We will not listen for any configuration changes, mainly because I couldn’t think of any obvious reason to do that for the Mixpanel case (if you have any ideas let me know).
Is quite simple in our case. As we don’t have to listen for any configuration changes, its main job is to read the configuration parameters and generate “configs” for the tasks. The configurations that we are passing to the SourceConnector are the following:
- TOPIC_NAME: the Kafka topic where we would like our data to be pushed.
- API_KEY: the api key of our Mixpanel project
- API_SECRET: again the api secret as it is taken from our Mixpanel project
- FROM_DATE: ok this is actual a design choice J this field contains the first date from which we’d like to pull data from. If we put here the current date, our connector will get the data for the last 24h, if we put a different date in the past it will start pulling data from that day until today.
The above information is stored in a properties file that is passed together with our code to Kafka Connect when we submit a new connector.
Regarding the task configuration, our connector will always create only one task and it will pass to it exactly the same parameters as the ones we mentioned above. As we will be monitoring only one endpoint of the Mixpanel API, it doesn’t really make any sense to create more than one tasks. At the beginning, I thought that it would be a good idea to create multiple tasks to parallelize the pulling of historical data by creating one task per day but then I decided that it doesn’t make much sense to follow this approach for the following reasons:
- Because of the Kafka Connect architecture, it would add a lot of complexity to the project. Tasks are designed as long lasting processes that when created keep polling for data, fetching historical data is a one-time process so we would end up with tasks that should be terminated or reconfigured somehow.
- Each task is a thread, ending up with 10s of them doesn’t make sense.
- We should always respect and not abuse the service we are using J
So, although it would be interesting to see how we could coordinate more than one tasks, it really didn’t make any sense to try it in this case. Maybe in the next one.
Now, here is where all the magic happens. After the task receives the configuration information inside the start method and we do any initializations we might need, the actual work happens inside the poll method. The “magic” includes the following steps:
- We check if we have a stored offset, if yes then we know that we are resuming from a previous date and check to see if we actually have to fetch new data or wait. If we have to fetch data:
- We figure out how many individual days we should fetch data for and pass it to a new client that will do the work, as a separate thread.
- The client will fetch the data for each day until it reaches the last one. When that happens it notifies the Task for the completion of the job.
- The SourceTask thread will get data for each day, from the client and return it to the Kafka Connect runtime, by creating SourceRecord instances and making sure that it always updates the offset. In this way, we try to make the connector robust.
- When the client ends, the Task will sleep until it has to repeat the process again.
A few important implementation considerations.
- Whatever you do inside the SourceTask, make sure that it does not block. To do that I used a non-blocking http client together with Java 8 futures.
- The SourceTask and Client threads are communicating through a thread-safe queue which is a quite common pattern to Kafka Connect connectors.
- If you have to wait, you should do that inside the poll method of the SourceTask. If you don’t do it, then the Kafka Connect runtime will keep polling your data service. So an important decision to make when you design a new connector is to figure out when it should interact with your service, how often and what events should interrupt this interaction.
- Make sure that you understand how concurrency works and be ready to have some very awkward moments staring at your screen trying to figure out what your connector is doing. Also, building a connector is an awesome way of learning about concurrency, so do it J
About the data…
As we said at the beginning, in order to keep things simple I didn’t do any fancy work with the schema of the Mixpanel data. Every event from Mixpanel is stored as a string inside Kafka. As I couldn’t use any information from the data to enrich the offset that I was tracking, I used only the date from the latest pull of data stored as an offset. In this way, the connector can check to see if enough time has passed before calling the Mixpanel API again to pull new data.
That’s it. We now have a Source connector that will pull historical event data from Mixpanel and then keep pulling our new data every day. We can either run it in a standalone mode (mainly for testing) or submit it to be executed inside the Kafka Connect cluster.
As someone who works with data and who’s using Kafka, I have to admit that I enjoyed building a connector with Kafka Connect and I found its design elegant. The connector that I built was quite simple, especially if you compare it with the JDBC connector but it helped understand how Connect is designed and some of the difficulties that you are facing when you want to build something that moves data between different systems. Some remarks and thoughts below:
Building systems that deal with data that has to be moved between different systems at scale is hard. Even if you didn’t have to worry about the scale, ensuring the consistency of the data alone is a difficult problem. Kafka Connect makes your life easier as it does a great job in abstracting the communication with your Kafka cluster. You can build a connector with truly minimal knowledge about the internals of Kafka.
Working with Kafka Connect is all about concurrency. You need to make sure that everything you write does not block and ensure the proper communication between your threads. Concurrent programming is hard and although you will not avoid its difficulties with Kafka Connect, its design will help you to avoid some common mistakes. Be ready though to have your moments in front of your screen J
Some additional abstraction on the framework would probably help. For example, the pattern where you use the blocking queue for communication between the two threads is pretty common. The same with monitoring threads. I feel that there’s space for building some pre-defined patterns that would make the life of developers much easier. Something similar to what Apache Curator did for Zookeeper, where common “recipes” are implemented.
Access your offset through your context only once at the beginning. Offset management happens concurrently by Kafka Connect and relying on your context to access the latest data is not safe. I found out about this the hard way but through that I also realized how well commented the code of Kafka Connect is. Also, I noticed that in standalone mode where offsets are stored in a file, it is possible to end up without updating correctly the offset. It might happen if you kill your connector really fast after it completes your task.
Understanding your data and what you want to achieve is important. What delivery semantics I need? How do I achieve idempotency and do I really need it? All these questions require from you to understand well your data and what you want to achieve. In my current implementation of the Mixpanel connector if something goes wrong with the offset that I’m trying it will be really difficult to avoid pushing duplicate records inside my Kafka topic. Kafka Connect offers some tools to deal with these problems but again it’s mainly up to you.
Kafka Connect was built with streams in mind, my use case with Mixpanel is not a stream of data and as I said at the beginning I did this on purpose. Nevertheless, I managed to achieve the behavior I wanted with Connect, although I’m not sure yet if having threads sleeping for 24h is better than scheduling a job to run every 24h.
Testing a connector is challenging, as it is very well put in the documentation Connect interacts with two systems that are difficult to mock and you might end up doing integration tests at the end. This was especially true with Mixpanel as it is an external service. Writing tests for the Task is quite challenging and I’m still trying to figure out the best way to do it.
I am co-counder and CEO of Blendo. A platform to help you connect and remix data from multiple sources – databases, CRM, email campaigns, analytics and more – into business value.