Improving the throughput of a Producer βœˆ

Hello πŸ‘‹,

In this article I will give you some tips on how to improve the throughput of a message producer.

I had to write a Golang based application which would consume messages from Apache Kafka and send them into a sink using HTTP JSON / HTTP Protocol Buffers.

To see if my idea works, I started using a naΓ―ve approach in which I polled Kafka for messages and then send each message into the sink, one at a time. This worked, but it was slow.

To better understand the system, a colleague has setup Grafana and a dashboard for monitoring, using Prometheus metrics provided by the Sink. This allowed us to test various versions of the producer and observe it’s behavior.

Let’s explore what we can do to improve the throughput.

Request batching πŸ“ͺ

A very important improvement is request batching.

Instead of sending one message at a time in a single HTTP request, try to send more, if the sink allows it.

As you can see in the image, this simple idea improved the throughput from 100msg/sec to ~4000msg/sec.

Batching is tricky, if your batches are large the receiver might be overwhelmed, or the producer might have a tough time building them. If your batches contain a few items you might not see an improvement. Try to choose a batch number which isn’t too high and not to low either.

Fast JSON libraries ⏩

If you’re using HTTP and JSON then it’s a good idea to replace the standard JSON library.

There are lots of open-source JSON libraries that provide much higher performance compared to standard JSON libraries that are built in the language.

See:

The improvements will be visible.

Partitioning πŸ–‡

There are several partitioning strategies that you can implement. It depends on your tech stack.

Kafka allows you to assign one consumer to one partition, if you have 3 partitions in a topic then you can run 3 consumer instances in parallel from that topic, in the same consumer group, this is called replication, I did not use this as the Sink does not allow it, only one instance of the Producer is running at a time.

If you have multiple topics that you want to consume from, you can partition on the topic name or topic name pattern by subscribing to multiple topics at once using regex. You can have 3 consumers consuming from highspeed.* and 3 consumer consuming from other.*. If each topic has 3 partitions.

Note: The standard build of librdkafka doesn’t support negative lookahead regex expressions, if that’s what you need you will need to build the library from source. See issues/2769. It’s easy to do and the confluent-kafka-go client supports custom builds of librdkafka.

Negative lookahead expressions allow you to ignore some patterns, see this example for a better understanding: regex101.com/r/jZ9AEz/1

Source Parallelization πŸ•Š

The confluent-kafka-go client allows you to poll Kafka for messages. Since polling is thread safe, it can be done in multiple goroutines or threads for a performance improvement.

import (
	"fmt"
	"github.com/confluentinc/confluent-kafka-go/kafka"
)

func main() {
    var wg sync.WaitGroup
	c, err := kafka.NewConsumer(&kafka.ConfigMap{
		"bootstrap.servers": "localhost",
		"group.id":          "myGroup",
		"auto.offset.reset": "earliest",
	})

	if err != nil {
		panic(err)
	}
   
	c.SubscribeTopics([]string{"myTopic", "^aRegex.*[Tt]opic"}, nil)
    for i := 0; i < 5; i++ {
        go func() {
            wg.Add(1)
            defer wg.Done()
	        for {
	    	    msg, err := c.ReadMessage(-1)
		        if err == nil {
		    	    fmt.Printf("Message on %s: %s\n", msg.TopicPartition, string(msg.Value))
                    // TODO: Send data through a channel to be processed by another goroutine.
    		    } else {
			        // The client will automatically try to recover from all errors.
			        fmt.Printf("Consumer error: %v (%v)\n", err, msg)
		        }
	        }
        }()
    }
    wg.Wait()

	c.Close()
}

Buffered Golang channels can also be used in this scenario in order to improve the performance.

Protocol Buffers πŸ”·

Finally, I saw a huge performance improvement when replacing the JSON body of the request with Protocol Buffers encoded and snappy compressed data.

If your Sink supports receiving protocol buffers, then it is a good idea to try sending it instead of JSON.

Honorable Mention: GZIP Compressed JSON πŸ“š

The Sink supported receiving GZIP compressed JSON, but in my case I didn’t see any notable performance improvements.

I’ve compared the RAM and CPU usage of the Producer, the number of bytes sent over the network and the message throughput. While there were some improvements in some areas, I decided not to implement GZIP compression.

It’s all about trade-offs and needs.

Conclusion

As you could see, there are several things you can do to your producers in order to make them more efficient.

  • Request Batching
  • Fast JSON Libraries
  • Partitioning
  • Source Parallelization & Buffering
  • Protocol Buffers
  • Compression

I hope you’ve enjoyed this article and learned something! If you have some ideas, please let me know in the comments.

Thanks for reading! πŸ˜€

AutoFixture in ASP.Net Core πŸŽ₯

Hello, πŸ‘‹

This is my first video blog post in which I try to explain AutoFixture.

Here’s the test case referenced in the video and the code repository for the Retroactiune project.

I used the following packages in my project.

    <ItemGroup>
        <PackageReference Include="AutoFixture" Version="4.17.0" />
        <PackageReference Include="AutoFixture.Xunit2" Version="4.17.0" />
    </ItemGroup>

Thanks for watching! πŸ™‚

Computer Speakers πŸ”Š

Hello,

Yes, this post title is weird, but I want to take a moment to acknowledge that I’m very grateful for dad for buying these computer speakers and much more.

In ~2002 my dad bought me my first computer, and at the same time a man approached us asking if we want to buy his second-hand speakers. I was about 5-6 years old at that time and I still remember the moment.

I’m 25 years old now and these are the only computer speakers that I’ve used, they are almost 20 years old!

Thank you dad! πŸ”‰

Sharding MongoDB using Range strategy

Hi πŸ‘‹πŸ‘‹

In this article I will explore the topic of sharding a Mongo Database that runs on Kubernetes. Before we get started, if you want to follow along, please install the tools listed in the prerequisites section, and if you want to learn more about sharding, check out this fantastic article Sharding Pattern.

Prerequisites

Introduction

Let’s install a MongoDB instance on the Kubernetes cluster using helm.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-mongo bitnami/mongodb-sharded

After the installation completes, save the database’s root password and replica set key. While doing this the first time I messed up and didn’t save them properly.

Run the following commands to print the password and replica set key on the command line. If you’re on Windows I have provided you with a Powershell function for base64 and if you’re on Unix don’t forget to pass –decode to base64.

kubectl get secret --namespace default my-release-mongodb-sharded -o jsonpath="{.data.mongodb-root-password}" | base64
kubectl get secret --namespace default my-release-mongodb-sharded -o jsonpath="{.data.mongodb-replica-set-key}" | base64

Sharding the Database

Verify that all your pods are running and start a shell connection to the mongos server.

	@denis ➜ ~ kubectl get pods
	NAME                                              READY   STATUS    RESTARTS   AGE
	my-mongo-mongodb-sharded-configsvr-0              1/1     Running   0          3m8s
	my-mongo-mongodb-sharded-configsvr-1              1/1     Running   0          116s
	my-mongo-mongodb-sharded-mongos-c4dd66768-dqlbv   1/1     Running   0          3m8s
	my-mongo-mongodb-sharded-shard0-data-0            1/1     Running   0          3m8s
	my-mongo-mongodb-sharded-shard0-data-1            1/1     Running   0          103s
	my-mongo-mongodb-sharded-shard1-data-0            1/1     Running   0          3m8s
my-mongo-mongodb-sharded-shard1-data-1            1/1     Running   0          93s
kubectl port-forward --namespace default svc/my-mongo-mongodb-sharded 27017:27017
# and in another terminal:
mongosh --host 127.0.0.1 --authenticationDatabase admin -u root -p $MONGODB_ROOT_PASSWORD

By running sh.status() you should get an output which contains two mongo shards:

shards
[
  {
    _id: 'my-mongo-mongodb-sharded-shard-0',
    host: 'my-mongo-mongodb-sharded-shard-0/my-mongo-mongodb-sharded-shard0-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard0-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017',
    state: 1
  },
  {
    _id: 'my-mongo-mongodb-sharded-shard-1',
    host: 'my-mongo-mongodb-sharded-shard-1/my-mongo-mongodb-sharded-shard1-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard1-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017',
    state: 1
  }
]

To enable sharding on the database and collection, I’m going to insert some dummy data in my_data database and my_users collections. The script used to insert the data is attached at the end of this blog post.

[direct: mongos]> sh.enableSharding("my_data")
{
  ok: 1,
  operationTime: Timestamp(3, 1628345449),
  '$clusterTime': {
    clusterTime: Timestamp(3, 1628345449),
    signature: {
      hash: Binary(Buffer.from("e57c8c37047f7aa170fb59f6b11e22aa65159a30", "hex"), 0),
      keyId: Long("6993682727694237708")
    }
  }
}

[direct: mongos]> db.my_users.createIndex({"t": 1})
[direct: mongos]> sh.shardCollection("my_data.my_users", { "t": 1 })

sh.addShardToZone("my-mongo-mongodb-sharded-shard-1", "TSR1")
sh.addShardToZone("my-mongo-mongodb-sharded-shard-0", "TSR2")

If you’ve made it this far, congrats, you’ve enabled sharding, now let’s define some rules.

Since we’re going to use a range sharding strategy based on the key t, and I have two shards available I would like my data to be distributed in the following way:

 sh.updateZoneKeyRange("my_data.my_users", {t: 46}, {t: MaxKey()}, "TSR2")
 sh.updateZoneKeyRange("my_data.my_users", {t: MinKey()}, {t: 46}, "TSR1")

Note: The label on the TSR2 Zone is wrong, the correct value is: 46 ≀ t < 1000

Running sh.status() should now yield the following output.

    collections: {
      'my_data.my_users': {
        shardKey: { t: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: { shard: 'my-mongo-mongodb-sharded-shard-1', nChunks: 3 },
        chunks: [
          {
            min: { t: MinKey() },
            max: { t: 45 },
            'on shard': 'my-mongo-mongodb-sharded-shard-1',
            'last modified': Timestamp(2, 1)
          },
          {
            min: { t: 46 },
            max: { t: MaxKey() },
            'on shard': 'my-mongo-mongodb-sharded-shard-0',
            'last modified': Timestamp(0, 2)
          }
        ],
        tags: [
          { tag: 'TSR1', min: { t: MinKey() }, max: { t: 46} },
          { tag: 'TSR2', min: { t: 46 }, max: { t: MaxKey() } }
        ]
      }

To test the rules, use the provided python script, modify the times variable and run it with various values.

You can run db.my_users.getShardDistribution() to view the data distribution on the shards.

[direct: mongos]> db.my_users.getShardDistribution()

Shard my-mongo-mongodb-sharded-shard-0 at my-mongo-mongodb-sharded-shard-0/my-mongo-mongodb-sharded-shard0-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard0-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017
{
  data: '144KiB',
  docs: 1667,
  chunks: 1,
  'estimated data per chunk': '144KiB',
  'estimated docs per chunk': 1667
}

Shard my-mongo-mongodb-sharded-shard-1 at my-mongo-mongodb-sharded-shard-1/my-mongo-mongodb-sharded-shard1-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard1-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017
{
  data: '195KiB',
  docs: 2336,
  chunks: 3,
  'estimated data per chunk': '65KiB',
  'estimated docs per chunk': 778
}

Adding More Shards

To add more shards to the cluster all we need to do is run helm upgrade, if you don’t mess up the replica set key like I did it should work on the first run.

helm upgrade my-mongo bitnami/mongodb-sharded --set shards=3,configsvr.replicas=2,shardsvr.dataNode.replicas=2,mongodbRootPassword=tcDMM5sqNC,replicaSetKey=D6BGM2ixd3

If you mess up the key πŸ˜…, then to solve the issue and bring your cluster back online follow these steps.

  1. downgrade the cluster back to 2 shards
  2. SSH into an old working shard shard1 or shard0, and grab the credentials from the environment variables.

The kubernetes secret and mongos pod’s credential have been overridden by the upgrade and they are wrong!

MONGODB_ROOT_PASSWORD=tcDMM5sqNC
MONGODB_ENABLE_DIRECTORY_PER_DB=no
MONGODB_SYSTEM_LOG_VERBOSITY=0
MY_MONGO_MONGODB_SHARDED_SERVICE_PORT=27017
KUBERNETES_SERVICE_HOST=10.245.0.1
MONGODB_REPLICA_SET_KEY=D6BGM2ixd3

After you save the correct password and replica set key, search for the volumes that belong to the shards which have the wrong replica set key and delete them. In my case I only delete the volumes which belong to the 3rd shard that I’ve added, since counting starts from 0, I’m looking for shard2 in the name.

@denis ➜ Downloads kubectl get persistentvolumeclaims
NAME                                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
datadir-my-mongo-mongodb-sharded-configsvr-0       Bound    pvc-8e7fa303-9198-419e-a6c1-8de3e6d89962   8Gi        RWO            do-block-storage   132m
datadir-my-mongo-mongodb-sharded-configsvr-1       Bound    pvc-6e3bc70f-83a8-4e80-b856-c44a4295be35   8Gi        RWO            do-block-storage   131m
datadir-my-mongo-mongodb-sharded-shard0-data-0     Bound    pvc-f66647bc-ee3b-4820-b466-a11b197fde74   8Gi        RWO            do-block-storage   132m
datadir-my-mongo-mongodb-sharded-shard0-data-1     Bound    pvc-62257e91-d461-4ddb-af37-4876d2431703   8Gi        RWO            do-block-storage   131m
datadir-my-mongo-mongodb-sharded-shard1-data-0     Bound    pvc-9a062ba5-f320-49c9-ae15-d75e8e5f2cf8   8Gi        RWO            do-block-storage   132m
datadir-my-mongo-mongodb-sharded-shard1-data-1     Bound    pvc-068b04bd-8875-40d7-b47c-40092ceb7973   8Gi        RWO            do-block-storage   130m
datadir-my-mongo-mongodb-sharded-shard2-data-0     Bound    pvc-93d9a238-ae36-49e1-b0b6-f320baf89373   8Gi        RWO            do-block-storage   73m
datadir-my-mongo-mongodb-sharded-shard2-data-1     Bound    pvc-b09a8d0d-5012-4f23-8096-a713f3025521   8Gi        RWO            do-block-storage   50m
@denis ➜ Downloads kubectl get persistentvolumes
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                      STORAGECLASS       REASON   AGE
pvc-068b04bd-8875-40d7-b47c-40092ceb7973   8Gi        RWO            Delete           Bound    default/datadir-my-mongo-mongodb-sharded-shard1-data-1     do-block-storage            131m
pvc-321136d8-8a27-45cb-8ed1-8d636c530859   8Gi        RWO            Delete           Bound    default/datadir-my-release-mongodb-sharded-shard2-data-1   do-block-storage            143m
pvc-42dd7167-5836-4e94-bf42-473c6cea49a4   8Gi        RWO            Delete           Bound    default/datadir-my-release-mongodb-sharded-shard2-data-0   do-block-storage            145m
pvc-48714777-97b3-4acc-8562-7b69a8e3b488   8Gi        RWO            Delete           Bound    default/datadir-my-release-mongodb-sharded-shard1-data-1   do-block-storage            143m
pvc-499797e9-a5df-4c7b-a1fb-482c3dca36a6   8Gi        RWO            Delete           Bound    default/datadir-my-release-mongodb-sharded-shard3-data-1   do-block-storage            143m
pvc-61ec9e04-1bad-4312-ba16-fb24c12efb4b   8Gi        RWO            Delete           Bound    default/datadir-my-release-
...

After that’s done, run the helm upgrade command again and if everything is working get a mongosh connection πŸ˜€.

Running sh.status() will now show the 3rd shard.

[
  {
    _id: 'my-mongo-mongodb-sharded-shard-0',
    host: 'my-mongo-mongodb-sharded-shard-0/my-mongo-mongodb-sharded-shard0-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard0-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017',
    state: 1,
    tags: [ 'TSR2' ]
  },
  {
    _id: 'my-mongo-mongodb-sharded-shard-1',
    host: 'my-mongo-mongodb-sharded-shard-1/my-mongo-mongodb-sharded-shard1-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard1-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017',
    state: 1,
    tags: [ 'TSR1' ]
  },
  {
    _id: 'my-mongo-mongodb-sharded-shard-2',
    host: 'my-mongo-mongodb-sharded-shard-2/my-mongo-mongodb-sharded-shard2-data-0.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017,my-mongo-mongodb-sharded-shard2-data-1.my-mongo-mongodb-sharded-headless.default.svc.cluster.local:27017',
    state: 1
  }
]

Next, update the sharding rules and everything will be working as in the diagram.

sh.addShardToZone("my-mongo-mongodb-sharded-shard-2", "TSR3")
sh.removeRangeFromZone("my_data.my_users", {t: 46}, {t: MaxKey()}, "TSR2")
sh.updateZoneKeyRange("my_data.my_users", {t: 46}, {t 1000}, "TSR2")
sh.updateZoneKeyRange("my_data.my_users", {t: 1000}, {t: MaxKey()}, "TSR3")

sh.status() should show something like

        chunks: [
          {
            min: { t: MinKey() },
            max: { t: 46 },
            'on shard': 'my-mongo-mongodb-sharded-shard-1',
            'last modified': Timestamp(0, 5)
          },
          {
            min: { t: 46 },
            max: { t: 1000 },
            'on shard': 'my-mongo-mongodb-sharded-shard-0',
            'last modified': Timestamp(3, 4)
          },
          {
            min: { t: 1000 },
            max: { t: MaxKey() },
            'on shard': 'my-mongo-mongodb-sharded-shard-2',
            'last modified': Timestamp(1, 5)
          }
        ],
        tags: [
          { tag: 'TSR1', min: { t: MinKey() }, max: { t: 46 } },
          { tag: 'TSR2', min: { t: 46 }, max: { t: 1000 } },
          { tag: 'TSR3', min: { t: 1000 }, max: { t: MaxKey() } }
        ]
      }

Conclusions

Shading a MongoDB can seem intimidating at first, but with some practice in advance you can do it! If sharding doesn’t work out for you, you can Convert Sharded Cluster to Replica Set, but, be prepared with some backups.

Thanks for reading πŸ“š and happy hacking! πŸ”©πŸ”¨

Base64 Powershell Function
function global:Convert-From-Base64 {
  [CmdletBinding()]
  [Alias('base64')]
  param (
    [parameter(ValueFromPipeline,Mandatory=$True,Position=0)]
    [string] $EncodedText
  )
  process {
    [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String($EncodedText))
  }
}

Python Script

import random

import pymongo


def do_stuff():
    client = pymongo.MongoClient("mongodb://root:tcDMM5sqNC@127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000")
    col = client.my_data.my_users

    usernames = ["dovahkiin", "rey", "dey", "see", "mee", "rollin", "they", "hating"]
    hobbies = ["coding", "recording", "streaming", "batman", "footbal", "sports", "mathematics"]
    ages = [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]
    # times = [12, 14, 15, 23, 45, 32, 20]
    times = [47, 80, 93, 49, 96, 43]

    buffer = []
    for _ in range(1_000):
        first = random.choice(usernames).capitalize()
        mid = random.choice(usernames).capitalize()
        last = random.choice(usernames).capitalize()
        buffer.append(pymongo.InsertOne({
            "name": f"{first} '{mid}' {last}",
            "age": random.choice(ages),
            "hobbies": random.choice(hobbies),
            "t": random.choice(times)
        }))
    col.bulk_write(buffer)


if __name__ == '__main__':
    do_stuff()

References

https://bitnami.com/stack/mongodb-sharded/helm

https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding

https://docs.mongodb.com/manual/core/zone-sharding/

https://docs.mongodb.com/manual/core/ranged-sharding/

https://docs.mongodb.com/manual/reference/method/sh.updateZoneKeyRange/

https://docs.mongodb.com/v5.0/core/sharding-choose-a-shard-key/

Blue vector created by starline – www.freepik.com