Kafka Connect MongoDB Sink Connector

Hello 👋,

In this article we’re going to build a data pipeline that connects Kafka to MongoDB.

In short, we’re going to add a MongoDB Sink connector to a Kafka Connect cluster and run a MongoDB instance in Docker to test the connector.

By reading this article I hope that you will learn

  • How to install the MongoDB connector in Kafka Connect
  • How to configure the MongoDB connector
  • How to create topics in Kafka using Confluent Tools
  • How to monitor Kafka Connect using JConsole.

Let’s get started!

Running MongoDB with Docker Compose 🚢

Confluent provides us with a docker-compose file that already contains everything we need, except for some minor tweaks.

Please download the following file and open it in your favorite editor: https://github.com/confluentinc/cp-all-in-one/blob/6.2.0-post/cp-all-in-one-community/docker-compose.yml.

Apply the following edits to the file, you can replace the connect block and append the mongodb block at the end of the file.

  
  connect:
    image: cnfldemos/kafka-connect-datagen:0.5.0-6.2.0
    hostname: connect
    container_name: connect
    depends_on:
      - broker
      - schema-registry
    ports:
      - "8083:8083"
      - "9102:9102"
    environment:
      CONNECT_BOOTSTRAP_SERVERS: 'broker:29092'
      CONNECT_REST_ADVERTISED_HOST_NAME: connect
      CONNECT_REST_PORT: 8083
      CONNECT_GROUP_ID: compose-connect-group
      CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
      CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
      CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
      CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
      CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
      CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.storage.StringConverter
      CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081
      CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components"
      CONNECT_LOG4J_LOGGERS: org.apache.zookeeper=ERROR,org.I0Itec.zkclient=ERROR,org.reflections=ERROR
      KAFKA_JMX_PORT: 9102
      KAFKA_JMX_HOSTNAME: localhost

  mongodb:
    image: mongo:4.2-rc-bionic
    hostname: mongodb
    container_name: mongodb
    depends_on:
      - broker
      - connect
    ports:
      - 27017:27017

Run docker-compose up to start all services and verify that everything is running with docker ps:

CONTAINER ID   IMAGE                                         COMMAND                  CREATED          STATUS
         PORTS
                       NAMES
95165f0156f4   confluentinc/cp-ksqldb-cli:6.2.0              "/bin/sh"                37 minutes ago   Up 37 minutes    
                       ksqldb-cli
ecc4cde0f30b   confluentinc/ksqldb-examples:6.2.0            "bash -c 'echo Waiti…"   37 minutes ago   Up 37 minutes    
                       ksql-datagen
962204b34543   mongo:4.2-rc-bionic                           "docker-entrypoint.s…"   37 minutes ago   Up 37 minutes             0.0.0.0:27017->27017/tcp, :::27017->27017/tcp
                       mongodb
c950f33f501a   confluentinc/cp-ksqldb-server:6.2.0           "/etc/confluent/dock…"   37 minutes ago   Up 37 minutes             0.0.0.0:8088->8088/tcp, :::8088->8088/tcp
                       ksqldb-server
3527577701d3   confluentinc/cp-kafka-rest:6.2.0              "/etc/confluent/dock…"   37 minutes ago   Up 37 minutes             0.0.0.0:8082->8082/tcp, :::8082->8082/tcp
                       rest-proxy
ca69f204f4bb   cnfldemos/kafka-connect-datagen:0.5.0-6.2.0   "/etc/confluent/dock…"   37 minutes ago   Up 31 minutes (healthy)   0.0.0.0:8083->8083/tcp, :::8083->8083/tcp, 0.0.0.0:9102->9102/tcp, :::9102->9102/tcp, 9092/tcp
                       connect
aeaea67059c3   confluentinc/cp-schema-registry:6.2.0         "/etc/confluent/dock…"   37 minutes ago   Up 37 minutes             0.0.0.0:8081->8081/tcp, :::8081->8081/tcp
                       schema-registry
b9a761b98a49   confluentinc/cp-kafka:6.2.0                   "/etc/confluent/dock…"   37 minutes ago   Up 37 minutes             0.0.0.0:9092->9092/tcp, :::9092->9092/tcp, 0.0.0.0:9101->9101/tcp, :::9101->9101/tcp, 0.0.0.0:29092->29092/tcp, :::29092->29092/tcp   broker
ca63570b60d4   confluentinc/cp-zookeeper:6.2.0               "/etc/confluent/dock…"   37 minutes ago   Up 37 minutes             2888/tcp, 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 3888/tcp
                       zookeeper

Installing the MongoDB Sink Connector on Kafka Connect 🌠

You may download the connector directly from Github mongodb/mongo-kafka/releases/tag/r1.6.0.

Click on mongodb-kafka-connect-mongodb-1.6.0.zip then unzip it and copy the directory into the plugin path /usr/share/java as defined in the CONNECT_PLUGIN_PATH: “/usr/share/java,/usr/share/confluent-hub-components” environment variable.

To copy it you can run:

docker cp .\mongodb-kafka-connect-mongodb-1.6.0\ connect:/usr/share/java/
 docker restart connect
connect

Connect needs to be restarted to pick-up the newly installed plugin. Verify that the connector plugin has been successfully installed:

➜  bin curl -s -X GET http://localhost:8083/connector-plugins | jq | head -n 20
[
  {
    "class": "com.mongodb.kafka.connect.MongoSinkConnector",
    "type": "sink",
    "version": "1.6.0"
  },
  {
    "class": "com.mongodb.kafka.connect.MongoSourceConnector",
    "type": "source",
    "version": "1.6.0"
  },

Note: If you don’t have jq installed you can omit it.

Creating the topics

Before starting the connector, let’s create the Kafka Topics events and events.deadletter, they will be used them in the connector.

To create the topics, we will need to download Confluent tools and run kafka-topics.

curl -s -O http://packages.confluent.io/archive/6.2/confluent-community-6.2.0.tar.gz
tar -xzf .\confluent-community-6.2.0.tar.gz
cd .\confluent-6.2.0\bin\

 ./kafka-topics --bootstrap-server localhost:9092 --list
__consumer_offsets
__transaction_state
_confluent-ksql-default__command_topic
_schemas
default_ksql_processing_log
docker-connect-configs
docker-connect-offsets
docker-connect-status

./kafka-topics --bootstrap-server localhost:9092 --create --topic events --partitions 3
Created topic events.

./kafka-topics --bootstrap-server localhost:9092 --create --topic events.deadletter --partitions 3
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues, it is best to use either, but not both.
Created topic events.deadletter.

Note: You will need Java to run the Confluent tools if you’re on Ubuntu you can type sudo apt install openjdk-8-jdk.

Starting the connector 🚙

To start the connector, it is enough to do a single post request to the connector’s API with the connector’s configuration.

The configuration that we will use is going to be:

curl --request POST \
  --url http://localhost:8083/connectors \
  --header 'Content-Type: application/json' \
  --data '{
	"name": "mongo-sink-connector",
	"config": {
		"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
		"tasks.max": "1",
		"topics": "events",
		"connection.uri": "mongodb://mongodb:27017/my_events",
		"database": "my_events",
		"collection": "kafka_events",
		"max.num.retries": 5,
		"mongo.errors.tolerance": "all",
		"mongo.errors.log.enable": true,
		"errors.log.include.messages": true,
		"errors.deadletterqueue.topic.name": "events.deadletter",
		"errors.deadletterqueue.context.headers.enable": true,
	}
}'

In short, this POST will create a new connector named mongo-sink-connector using the com.mongodb.kafka.connect.MongoSinkConnector java class, run a single connector task that will get all the messages from the events topic and put them into the Mongo found at mongodb://mongodb:27017/my_events, database named my_events and collection named kafka_events. The records which will fail to be written into the database will be placed on a dead letter topic named events.deadletter, in my opinion this is better than discarding them, since we can inspect the topic to see what went wrong.

To verify that the connector is running, you can retrieve its first tasks status with:

➜  bin curl -s -X GET http://localhost:8083/connectors/mongo-sink-connector/tasks/0/status | jq
{
  "id": 0,
  "state": "RUNNING",
  "worker_id": "connect:8083"
}

Querying the Database 🗃

Now that our Kafka Connect cluster is running and is configured, all that’s left to do is POST some dummy data into Kafka and check for it in the database.

curl --request POST \
  --url http://localhost:8082/topics/events \
  --header 'Content-Type: application/vnd.kafka.json.v2+json' \
  --data '{
	"records": [
		{
			"key": "somekey",
			"value": {
				"glossary": {
					"title": "example glossary",
					"GlossDiv": {
						"title": "S",
						"GlossList": {
							"GlossEntry": {
								"ID": "SGML",
								"SortAs": "SGML",
								"GlossTerm": "Standard Generalized Markup Language",
								"Acronym": "SGML",
								"Abbrev": "ISO 8879:1986",
								"GlossDef": {
									"para": "A meta-markup language, used to create markup languages such as DocBook.",
									"GlossSeeAlso": [
										"GML",
										"XML"
									]
								},
								"GlossSee": "markup"
							}
						}
					}
				}
			}
		}
	]
}'

That’s all! 🎉If we now connect to the database using mongosh or any other client, we can query the data.

mongosh
> use my_events
switched to db my_events
> db.kafka_events.findOne()
{
  _id: ObjectId("6147242856623b0098fc756d"),
  glossary: {
    title: 'example glossary',
    GlossDiv: {
      title: 'S',
      GlossList: {
        GlossEntry: {
          ID: 'SGML',
          SortAs: 'SGML',
          GlossTerm: 'Standard Generalized Markup Language',
          Acronym: 'SGML',
          Abbrev: 'ISO 8879:1986',
          GlossDef: {
            para: 'A meta-markup language, used to create markup languages such as DocBook.',
            GlossSeeAlso: [ 'GML', 'XML' ]
          },
          GlossSee: 'markup'
        }
      }
    }
  }
}

Viewing Kafka Connect JMX Metrics

JConsole is a tool that can be used to view JMX metrics exposed by Kafka Connect, if you installed openjdk-8 it should come with it

Start JConsole and connect to localhost:9102. If you get a warning about an insecure connection, accept the connection, and ignore it.

After you’re connected click the MBeans tab and explore 🦹‍♀️

Summary

Getting into Kafka and Kafka Connect can be a bit overwhelming at first. I hope that this tutorial has provided you with the necessary basics so you can continue to play and explore on your own.

Spinning up a playground for Kafka and Connect using docker-compose isn’t that complicated, you can start from the confluent-cp-community repo, it will give you everything you need to get started. With some little modifications to the docker-compose file, we’ve spawned a MongoDB instance and exposed the JMX metrics in Kafka Connect.

Next, we’ve installed and configured the MongoDB connector and confirmed that it works as expected.

If you have any questions let me know in the comments.

Until next time! 🍻

Improving the throughput of a Producer ✈

Hello 👋,

In this article I will give you some tips on how to improve the throughput of a message producer.

I had to write a Golang based application which would consume messages from Apache Kafka and send them into a sink using HTTP JSON / HTTP Protocol Buffers.

To see if my idea works, I started using a naïve approach in which I polled Kafka for messages and then send each message into the sink, one at a time. This worked, but it was slow.

To better understand the system, a colleague has setup Grafana and a dashboard for monitoring, using Prometheus metrics provided by the Sink. This allowed us to test various versions of the producer and observe it’s behavior.

Let’s explore what we can do to improve the throughput.

Request batching 📪

A very important improvement is request batching.

Instead of sending one message at a time in a single HTTP request, try to send more, if the sink allows it.

As you can see in the image, this simple idea improved the throughput from 100msg/sec to ~4000msg/sec.

Batching is tricky, if your batches are large the receiver might be overwhelmed, or the producer might have a tough time building them. If your batches contain a few items you might not see an improvement. Try to choose a batch number which isn’t too high and not to low either.

Fast JSON libraries ⏩

If you’re using HTTP and JSON then it’s a good idea to replace the standard JSON library.

There are lots of open-source JSON libraries that provide much higher performance compared to standard JSON libraries that are built in the language.

See:

The improvements will be visible.

Partitioning 🖇

There are several partitioning strategies that you can implement. It depends on your tech stack.

Kafka allows you to assign one consumer to one partition, if you have 3 partitions in a topic then you can run 3 consumer instances in parallel from that topic, in the same consumer group, this is called replication, I did not use this as the Sink does not allow it, only one instance of the Producer is running at a time.

If you have multiple topics that you want to consume from, you can partition on the topic name or topic name pattern by subscribing to multiple topics at once using regex. You can have 3 consumers consuming from highspeed.* and 3 consumer consuming from other.*. If each topic has 3 partitions.

Note: The standard build of librdkafka doesn’t support negative lookahead regex expressions, if that’s what you need you will need to build the library from source. See issues/2769. It’s easy to do and the confluent-kafka-go client supports custom builds of librdkafka.

Negative lookahead expressions allow you to ignore some patterns, see this example for a better understanding: regex101.com/r/jZ9AEz/1

Source Parallelization 🕊

The confluent-kafka-go client allows you to poll Kafka for messages. Since polling is thread safe, it can be done in multiple goroutines or threads for a performance improvement.

import (
	"fmt"
	"github.com/confluentinc/confluent-kafka-go/kafka"
)

func main() {
    var wg sync.WaitGroup
	c, err := kafka.NewConsumer(&kafka.ConfigMap{
		"bootstrap.servers": "localhost",
		"group.id":          "myGroup",
		"auto.offset.reset": "earliest",
	})

	if err != nil {
		panic(err)
	}
   
	c.SubscribeTopics([]string{"myTopic", "^aRegex.*[Tt]opic"}, nil)
    for i := 0; i < 5; i++ {
        go func() {
            wg.Add(1)
            defer wg.Done()
	        for {
	    	    msg, err := c.ReadMessage(-1)
		        if err == nil {
		    	    fmt.Printf("Message on %s: %s\n", msg.TopicPartition, string(msg.Value))
                    // TODO: Send data through a channel to be processed by another goroutine.
    		    } else {
			        // The client will automatically try to recover from all errors.
			        fmt.Printf("Consumer error: %v (%v)\n", err, msg)
		        }
	        }
        }()
    }
    wg.Wait()

	c.Close()
}

Buffered Golang channels can also be used in this scenario in order to improve the performance.

Protocol Buffers 🔷

Finally, I saw a huge performance improvement when replacing the JSON body of the request with Protocol Buffers encoded and snappy compressed data.

If your Sink supports receiving protocol buffers, then it is a good idea to try sending it instead of JSON.

Honorable Mention: GZIP Compressed JSON 📚

The Sink supported receiving GZIP compressed JSON, but in my case I didn’t see any notable performance improvements.

I’ve compared the RAM and CPU usage of the Producer, the number of bytes sent over the network and the message throughput. While there were some improvements in some areas, I decided not to implement GZIP compression.

It’s all about trade-offs and needs.

Conclusion

As you could see, there are several things you can do to your producers in order to make them more efficient.

  • Request Batching
  • Fast JSON Libraries
  • Partitioning
  • Source Parallelization & Buffering
  • Protocol Buffers
  • Compression

I hope you’ve enjoyed this article and learned something! If you have some ideas, please let me know in the comments.

Thanks for reading! 😀

Kafka Connect GitHub source connector

Hello!

In this article we will discuss how to quickly get started with Kafka and Kafka Connect to grab all the commits from a Github repository.

This is a practical tutorial which saves you some time browsing the Kafka’s documentation.

Environment

Kafka is bit difficult to setup, you will need Kafka, Zookeper and Kafka Connect at least. To simplify the setup we’re going to use Docker and docker-compose.

Once you have Docker and docker-compose installed on your system, you can run a single command and get a Kafka environment for developing.

Let’s start Kafka by using the Confluent Platform repository, make sure you have Docker, Docker-Compose and Git installed before proceding.

git clone https://github.com/confluentinc/cp-all-in-one
cd .\cp-all-in-one\cp-all-in-one-community
docker-compose up

After the docker-compose up command finishes, Kafka and it’s related components should be up and running. To make sure all the components are running properly please start a new terminal window and compare your docker ps output with mine:

CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS                            PORTS                                                                      NAMES
fc95d4828397        confluentinc/ksqldb-examples:5.5.1            "bash -c 'echo Waiti…"   6 minutes ago       Up 6 minutes
                 ksql-datagen
28e67d645888        confluentinc/cp-ksqldb-cli:5.5.1              "/bin/sh"                6 minutes ago       Up 6 minutes
                 ksqldb-cli
371fd742ad1f        confluentinc/cp-ksqldb-server:5.5.1           "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes (healthy)            0.0.0.0:8088->8088/tcp
                 ksqldb-server
caeb86c71308        cnfldemos/kafka-connect-datagen:0.3.2-5.5.0   "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes (health: starting)   0.0.0.0:8083->8083/tcp, 9092/tcp
                 connect
c4d90361c677        confluentinc/cp-kafka-rest:5.5.1              "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes                      0.0.0.0:8082->8082/tcp
                 rest-proxy
c3752a0c1487        confluentinc/cp-schema-registry:5.5.1         "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes                      0.0.0.0:8081->8081/tcp
                 schema-registry
14bfbef71822        confluentinc/cp-kafka:5.5.1                   "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes                      0.0.0.0:9092->9092/tcp, 0.0.0.0:9101->9101/tcp, 0.0.0.0:29092->29092/tcp   broker
f2ddb8971efa        confluentinc/cp-zookeeper:5.5.1               "/etc/confluent/dock…"   6 minutes ago       Up 6 minutes                      2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp
                 zookeeper

If it looks similar then you’re ready to start configuring Kafka Connect.

Kafka Connect

Kafka Connect is a great too for building data pipelines. We’re going to use it to get data from Github into Kafka.

We could write a simple python producer in order to do that, query Github’s API and produce a record for Kafka using a client library, but, Kafka Connect comes with additional benefits. It’s like Docker for Producers and Consumers, in fact, in Kafka Connect, the producers are called source connectors and the consumers sink connectors.

Installing the Github connector

To install the Github connector which is supported by Confluent, you have to download it from here. (click download zip)

After the download has finished, unzip the files.

Before installing it, if you visit: http://localhost:8083/connector-plugins you will see the currently installed connectors, there’s currently no Github connector.

To install it, we need to copy it’s folder into the Kafka Connect container, and restart it:

docker cp ./confluentinc-kafka-connect-github-1.0.1/. connect:/usr/share/java
docker restart connect

If we visit http://localhost:8083/connector-plugins again, we’ll see that the connector has been installed successfully:

{"class":"io.confluent.connect.github.GithubSourceConnector","type":"source","version":"1.0.1"},

Configuring the connector

There’s not configuration for the connector thus the connector isn’t running. To configured a connector you need to make a post request to the configuration endpoint, you can have as many configuration as you like for the connector, each configuration represents a worker in it’s own, the catch is that you need to use a different name.

To view the already configured connectors we can visit: localhost:8083/connectors/. As expected, an empty JSON should be returned.

To configure the Github connector, I took the sample config from it’s docummentation and adapted it into json format.

I’ve also chosen to use the JsonConverter instead of the AvroConverter for simplicity, and instead of getting the stargazers I’m getting the commits of the repo instead.

Here’s the config:

{
"name":"MyGithubConnector",
"config":{
"connector.class":"io.confluent.connect.github.GithubSourceConnector",
"confluent.topic.bootstrap.servers":"broker:29092",
"tasks.max":"1",
"confluent.topic.replication.factor":"1",
"github.service.url":"https://api.github.com",
"github.repositories":"apache/kafka",
"github.tables":"commits",
"github.since":"2019-01-01",
"github.access.token":"YOUR_PERSONAL_ACCESS_TOKEN",
"topic.name.pattern":"github-avro-${entityName}",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":" org.apache.kafka.connect.json.JsonConverter"
}
}

Now, we need to upload the config to the Kafka Connect, I saved it as a file named github-connector-config.json and used curl to upload it. If the upload is successful the server should echo back your config:

➜ cp-all-in-one-community git:(5.5.1-post) ✗ curl -X POST -H "Content-Type: application/json" --data @github-connector-config.json http://localhost:8083/connectors

{"name":"MyGithubConnector","config":{"connector.class":"io.confluent.connect.github.GithubSourceConnector","confluent.topic.bootstrap.servers":"broker:29092","tasks.max":"1","confluent.topic.replication.factor":"1","github.service.url":"https://api.github.com","github.repositories":"apache/kafka","github.tables":"commits","github.since":"2019-01-01","github.access.token":"xxx","topic.name.pattern":"github-avro-${entityName}","key.converter":"org.apache.kafka.connect.json.JsonConverter","value.converter":" org.apache.kafka.connect.json.JsonConverter","name":"MyGithubConnector"},"tasks":[],"type":"source"}%

Visiting http://localhost:8083/connectors/ now returns our config for our Github connector:

["MyGithubConnector"]

If we check it’s status (http://localhost:8083/connectors/MyGithubConnector/status) we should see that the connector is already running:

{"name":"MyGithubConnector","connector":{"state":"RUNNING","worker_id":"connect:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"connect:8083"}],"type":"source"}

The connector will pull all the commits it can get, in 5 minutes as I let it ran, It pulled over 2.5k in Github commits.

I will stop it in order to preserve disk space:

curl -X PUT localhost:8083/connectors/MyGithubConnector/pause

Inspecting the data in Kafka

To inspect the data I use a tool called kafkacat.

Running kafkacat -C -b localhost:9092 -t github-avro-commits should show you the data collected from Github:

...
"commit":{ "author":{ "name":"hejiefang", "email":"xxxx", "date":1546437651000 }, "committer":{ "name":"Manikumar Reddy", "email":"xxxx", "date":1546437651000 }, "message":"MINOR: Fix doc format in upgrade notes\n\nAuthor: hejiefang <he.jiefang@zte.com.cn>\n\nReviewers: Srinivas <xxxx>, Manikumar Reddy <xxxx>\n\nCloses #6076 from hejiefang/modifynotable", "tree":{ "sha":"91db6646f829d6636011d09fdc8643e36280716b", "url":"https://api.github.com/repos/apache/kafka/git/trees/91db6646f829d6636011d09fdc8643e36280716b" }, "url":"https://api.github.com/repos/apache/kafka/git/commits/ffd6f2a2e8a573695d0c1c98e663f0b8198b1b6d", "comment_count":0, "verification":{ "verified":false, "reason":"unsigned", "signature":null, "payload":null } },
...

Summary

We’ve successfully setup a simple Kafka playground with docker-compose and we installed a SOURCE connector for Github. It was fairly easy and Kafka Connect’s rest API is quite powerful and user friendly.

In the next article we will take the commits data and index it in ElasticSearch using a SINK connector.

Thanks for reading!

Bonus

You can debug the connector by modifying the logging level:

// Logging set
curl -s -X PUT -H "Content-Type:application/json" \
http://localhost:8083/admin/loggers/io.confluent.connect.github.GithubSourceConnector \
-d '{"level": "TRACE"}' \
| jq '.'