Data Pipelines at Tealium: Scaling a Message Broker System for Half the Cost

Tealium Employee

Tealium is a data hub platform that processes events at a large scale. We've seen tremendous growth in the amount of incoming customer data, where a single account might receive up to 30k events per second. Each of those events is then processed through a real-time data pipeline that requires a massive event driven system. This has forced us to evaluate which message broker platform will serve us best in the long-term: RabbitMQ or Amazon Managed Streaming for Apache Kafka (Amazon MSK).

We are currently using RabbitMQ, but are considering alternative solutions for the following reasons:

  1. Cost - As event volume increases, the number of nodes and sizes of nodes also increases, causing a significant increase in Amazon EC2 cost.
  2. Scale - We manage our own RabbitMQ cluster, which involves monitoring both the health of the nodes and the software running on them. This extra work incurs its own cost in time and effort, and brings the added risk of message loss: all things we want to minimize.

Our goal is to build a system to handle at least 50k messages per second across multiple queues and process them in real-time while being cost effective.

We chose to evaluate Amazon MSK because it is maintained by Amazon and because Apache Kafka is a widely adopted solution with a great community backing. As we began investigating a new solution with Amazon MSK, these were the questions that we wanted to answer:

  1. What topic and producer configurations best fit our requirements?
  2. How many messages per second can a producer send to 1, 5, or 10 topics that reside in one MSK cluster given our desired configurations?
    • Does the number of topics on the cluster affect the traffic?
  3. How is the CPU usage of the cluster and the individual brokers affected by large volumes of data?
  4. What is the cost of running Amazon MSK at the desired scale?

Spoiler alert, not only did Amazon MSK meet our expectations, but we estimated that switching to Amazon MSK would cut costs in half!

In the remainder of this post, I will walk through each of these questions and explain how the answers brought us to the final decision of switching from RabbitMQ to Amazon MSK. In this post:

Table of Contents Placeholder

Topic and Producer Config

Question: What topic and producer configurations best fit our requirements?

Amazon MSK has a significant number of configuration settings for the topics, producers, and consumers, but I am only going to cover three: replication, number of partitions per topic, and producer acks. These three configuration settings have a common theme of durability and scalability when using this infrastructure which lines up with the requirements needed for our solution. Check out my explanations of each setting below.

Replication: Durability at the Topic Level

Replication is the process of having a message being written across multiple brokers for the purpose of availability and durability of the cluster. If one broker goes down, the data on that broker will be replicated on another. We tested two and three replicas because we want a level of replication for durability, but want to see how the number of replicas affects the number of messages we can send. Below is a table displaying the results of many tests we ran with one topic deployed to two different sized clusters where the message size is 3kb.

Instance Type

Brokers

Topics

Partitions per Topic

Replicas

ACKS

Average Messages per Second

m5.large

3

1

12

2

1

~130,000

m5.large

3

1

12

3

1

~110,000

m5.xlarge

3

1

12

2

1

~150,000

m5.xlarge

3

1

12

3

1

~135,000

 

When there is a higher number of replicas, the cluster handles a lower throughput since it has to spend more time and cpu replicating that data. We decided to go with two replicas since we have a level of availability and durability and can handle a larger throughput of messages per second.

Acks: Durability at the Producer Level

Acks is a setting that allows you to control durability on the producer side. We considered two values for the acks settings: All or 1. The All setting ensures that the message has been delivered to all of the replicas within the topic and is the most durable option. The 1 setting represents “leader acknowledgment.” This means that the message is written to local logs but does not wait for the replicas to confirm they have received the message. With this setting, message loss is possible.

We ran tests with these two settings to measure the throughput of messages. The cluster configuration is a cluster of 3 brokers of size m5.large:

Topics

Partitions per Topic

Replicas

Acks

Average Messages per Second

5

12

2

1

~150,000

5

12

2

All

~22,500

5

24

2

1

~195,000

5

24

2

All

~36,000

5

50

2

1

~202,500

5

50

2

All

~40,500

 

It was very obvious that when acks is set to 1, the cluster can handle a higher throughput of messages being produced since it does not have the overhead of confirming delivery to the replicas. If we are aware and accept the risk of losing a message, setting acks to 1 will clearly have better performance. If we cannot accept that risk, setting acks to All is the best option for durability. We set our acks to All because we cannot accept the loss of messages.

Partitions per Topic

A partition allows you to split the data in a topic amongst multiple brokers in the cluster. This allows parallel consumers to process the data. We ran tests to see how increasing the number of partitions affects the throughput of messages. The cluster configuration is a cluster of 3 brokers of size m5.xlarge:

Topics

Partitions per Topic

Replicas

Acks

Average Messages per Second

5

12

2

1

~150,000

5

24

2

1

~157,500

5

50

2

1

~165,000

5

12

2

All

~63,750

5

24

2

All

~72,000

5

50

2

All

~75,000

 

Consistently, when the number of partitions increases, the average number of messages per second that can be produced and consumed also increases. In order for the partitions to be used optimally, consumers should be parallelized to consume those messages across the multiple partitions.

The partition key is an important component of managing the partition. The partition key indicates to the topic when a message gets sent to it, which partition is the destination for that message. For example: There is a topic with 5 partitions. If the partition key is pointing to partition 1, when a message gets sent to the topic, it will be placed on partition 1. If the partition key is pointing to partition 3, when a message gets sent to the topic, it will be placed on partition 3, and so on. Kafka handles this automatically under the hood, but you can set the partition key manually if needed. However, if you set the partition key manually, be careful to avoid unbalancing the partitions across the brokers because this will affect CPU usage across the brokers. This will be covered in a section below. We decided to let Kafka handle setting the partition key automatically until we have a good enough use case to maintain it ourselves.

Traffic of Multiple Topics

Question: How many messages per second can a producer send to 1, 5, or 10 topics that reside in one MSK cluster given our desired configurations? Does the number of topics on the cluster affect the traffic?

In order to optimally maintain and deploy infrastructure in a cost effective way in production, we want to ensure that we can have multiple topics reside in one cluster without impacting performance across each of the topics and across the cluster. Below is a table displaying the results of tests we ran with 1 topic, 5 topics, and 10 topics deployed to two different sized clusters.

Instance Type

Brokers

Topics

Partitions per Topic

Replicas

Acks

Average Messages per Second

m5.large

3

1

12

2

All

~22,000

m5.large

3

5

12

2

All

~22,500

m5.large

3

10

12

2

All

~20,000

m5.xlarge

3

1

12

2

All

~64,000

m5.xlarge

3

5

12

2

All

~63,750

m5.xlarge

3

10

12

2

All

~62,000

Overall, there is no direct correlation to the number of topics on the cluster affecting the amount of throughput of the number of messages per second produced and consumed. This is great because you don't need to maintain multiple clusters for multiple topics as this would multiply the costs.

Handling High CPU Usage

Question: How is the CPU usage of the cluster and the individual brokers affected by large volumes of data?

I encountered some interesting behavior while attempting to throttle Amazon MSK. I ran the performance tests using the Apache Kafka CLI deployed on compute heavy EC2 instances. The CLI has built in Producer Performance testing and Consumer Performance testing classes that attempt to performance test the cluster they are configured to produce and consume messages to and from. Here is an example CLI command for the Producer Performance Testing:

kafka_2.12-2.2.1/bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic test --num-records 50000000 --record-size 3000 --throughput -1 --producer-props bootstrap.servers=$broker_servers --print-metrics

Example Output:

40 records sent, 8.0 records/sec (0.08 MB/sec), 2595.8 ms avg latency, 4582.0 ms max latency.
56 records sent, 11.1 records/sec (0.11 MB/sec), 7010.7 ms avg latency, 9631.0 ms max latency.
...
55 records sent, 11.0 records/sec (0.10 MB/sec), 22335.0 ms avg latency, 24834.0 ms max latency.
1000 records sent, 9.984225 records/sec (0.10 MB/sec), 49693.23 ms avg latency, 99652.00 ms max latency, 49552 ms 50th, 94396 ms 95th, 98785 ms 99th, 99652 ms 99.9th.

Using multiple EC2 instances allows one to test performance with parallel producers and consumers, but also performance under throttling. I started with one producer firing messages to my test topic, then began adding other producers that would run the performance tool against the same topic. As I spun up more producers, the first two producers were able to increase the amount of traffic to the topic. As I continued to spin up more producers, Amazon MSK throttled the producers and the number of messages the producers were able to send per second decreased. Throughout all these tests, the broker CPU usage did not go above 80%. The average CPU usage was between 60% and 70%. This test gave us confidence in the health of the Amazon MSK cluster, but indicated the need to implement logic in our producers to ensure that they can handle being throttled by Amazon MSK.

CPU usage across the brokers should be evenly distributed. If the topic does not have partitions that are properly distributed across the topic, the CPU usage of each broker will not be equal and you will not get the highest throughput possible from your cluster. Below is an example of a CPU usage graph with a topic that is not properly partitioned across the brokers:

5_topics_3_xlarge_acks_all_24_partitions_re-partition.png

When the topic has been balanced appropriately across the brokers, the CPU usage will be balanced across the brokers. Here is an example of a CPU usage graph with a topic that is properly partitioned across the brokers:

5_topics_3_xlarge_acks_all_50_partitions.png

The CPU usage may not always be as balanced as the image shown above, but the point is to ensure the maximum utilization of each broker for maximum throughput. If they differ in 5-10% of CPU usage, this is an acceptable difference.

Cost Savings

The projected cost savings is over 50% since a cluster of 3 brokers of size m5.xlarge will handle the same throughput as our RabbitMQ configuration of over 20 nodes. This does not include the cost savings of man-hours not spent having to maintain the health of the individual nodes, so the overall real-world savings are even higher.

Conclusion

In the end, we made the decision to switch from RabbitMQ to Amazon MSK. The cost savings, ease of maintaining managed services, and all around scalability were more than enough to convince us it was the right move going forward. As an added bonus, this will allow us to take a look at the entire Kafka ecosystem, including Kafka Streaming, Flink, and Schema Registries, for future scalability.

References