The OpenMessaging Benchmark Framework
The OpenMessaging Benchmark Framework is a suite of tools that make it easy to benchmark distributed messaging systems in the cloud.
Project goals
The goal of the OpenMessaging Benchmark Framework is to provide benchmarking suites for an ever-expanding variety of messaging platforms. These suites are intended to be:
- Cloud based — All benchmarks are run on cloud infrastructure, not on your laptop
- Easy to use — Just a few CLI commands gets you from zero to completed benchmarks
- Transparent — All benchmarking code is open source, with pull requests very welcome
- Realistic — Benchmarks should be largely oriented toward standard use cases rather than bizarre edge cases
If you’re interested in contributing to the OpenMessaging Benchmark Framework, you can find the code on GitHub.
Supported messaging systems
OpenMessaging benchmarking suites are currently available for the following systems:
For each platform, the benchmarking suite includes easy-to-use scripts for deploying that platform on AlibabaCloud and Amazon Web Services (AWS) and then running benchmarks upon deployment. For end-to-end instructions, see platform-specific docs for:
OpenMessaging Benchmark Framework Components
The OpenMessaging Benchmark Framework contains two components - the driver, and the workers.
Driver - The main “driver” is responsible to assign the tasks, creating the benchmark topic, creating the consumers & producers, etc. The benchmark executor.
Worker - A benchmark worker that listens to tasks to perform them. A worker ensemble communicates over HTTP (defaults to port 8080
).
Basic usage & flags
Driver
$ sudo bin/benchmark \
--drivers driver-kafka/kafka-exactly-once.yaml \
--workers 1.2.3.4:8080,4.5.6.7:8080 \ # or -w 1.2.3.4:8080,4.5.6.7:8080
workloads/1-topic-16-partitions-1kb.yaml
Flag | Description | Default
:————————-|:——————————————————————|——-:
-c
/ --csv
| Print results from this directory to a CSV file. | N/A
-d
/ --drivers
| Drivers list. eg.: pulsar/pulsar.yaml,kafka/kafka.yaml | N/A
-x
/ --extra
| Allocate extra consumer workers when your backlog builds. | false
-w
/ --workers
| List of worker nodes. eg: http://1.2.3.4:8080,http://4.5.6.7:8080 | N/A
-wf
/ --workers-file
| Path to a YAML file containing the list of workers addresses. | N/A
-h
/ --help
| Print help message | false
Workers
$ sudo bin/benchmark-worker \
--port 9090 \
--stats-port 9091
Flag | Description | Default
:———————–|:————————-|——-:
-p
/ --port
| HTTP port to listen to. | 8080
-sp
/ --stats-port
| Stats port to listen to. | 8081
-h
/ --help
| Print help message | false
Benchmarking workloads
Benchmarking workloads are specified in YAML configuration files that are available in the workloads
directory. The table below describes each workload in terms of the following parameters:
- The number of topics
- The size of the messages being produced and consumed
- The number of subscriptions per topic
- The number of producers per topic
- The rate at which producers produce messages (per second). Note: a value of 0 means that messages are produced as quickly as possible, with no rate limiting.
- The size of the consumer’s backlog (in gigabytes)
- The total duration of the test (in minutes)
Workload | Topics | Partitions per topic | Message size | Subscriptions per topic | Producers per topic | Producer rate (per second) | Consumer backlog size (GB) | Test duration (minutes) |
---|---|---|---|---|---|---|---|---|
simple-workload.yaml |
1 | 10 | 1 kB | 1 | 1 | 10000 | 0 | 5 |
1-topic-1-partition-1kb.yaml |
1 | 1 | 1 kB | 1 | 1 | 50000 | 0 | 15 |
1-topic-1-partition-100b.yaml |
1 | 1 | 100 bytes | 1 | 1 | 50000 | 0 | 15 |
1-topic-16-partitions-1kb.yaml |
1 | 16 | 1 kB | 1 | 1 | 50000 | 0 | 15 |
backlog-1-topic-1-partition-1kb.yaml |
1 | 1 | 1 kB | 1 | 1 | 100000 | 100 | 5 |
backlog-1-topic-16-partitions-1kb.yaml |
1 | 16 | 1 kB | 1 | 1 | 100000 | 100 | 5 |
max-rate-1-topic-1-partition-1kb.yaml |
1 | 1 | 1 kB | 1 | 1 | 0 | 0 | 5 |
max-rate-1-topic-1-partition-100b.yaml |
1 | 1 | 100 bytes | 1 | 1 | 0 | 0 | 5 |
1-topic-3-partition-100b-3producers.yaml |
1 | 3 | 100 bytes | 1 | 3 | 0 | 0 | 15 |
max-rate-1-topic-16-partitions-1kb.yaml |
1 | 16 | 1 kB | 1 | 1 | 0 | 0 | 5 |
max-rate-1-topic-16-partitions-100b.yaml |
1 | 16 | 100 bytes | 1 | 1 | 0 | 0 | 5 |
max-rate-1-topic-100-partitions-1kb.yaml |
1 | 100 | 1 kB | 1 | 1 | 0 | 0 | 5 |
max-rate-1-topic-100-partitions-100b.yaml |
1 | 100 | 100 bytes | 1 | 1 | 0 | 0 | 5 |
Instructions for running specific workloads—or all workloads sequentially—can be found in the platform-specific documentation.
Interpreting the results
Initially, you should see a log message like this, which affirms that a warm-up phase is intiating:
22:03:19.125 [main] INFO - ----- Starting warm-up traffic ------
You should then see some just a handful of readings, followed by an aggregation message that looks like this:
22:04:19.329 [main] INFO - ----- Aggregated Pub Latency (ms) avg: 2.1 - 50%: 1.7 - 95%: 3.0 - 99%: 11.8 - 99.9%: 45.4 - 99.99%: 52.6 - Max: 55.4
At this point, the benchmarking traffic will begin. You’ll start see readings like this emitted every few seconds:
22:03:29.199 [main] INFO - Pub rate 50175.1 msg/s / 4.8 Mb/s | Cons rate 50175.2 msg/s / 4.8 Mb/s | Backlog: 0.0 K | Pub Latency (ms) avg: 3.5 - 50%: 1.9 - 99%: 39.8 - 99.9%: 52.3 - Max: 55.4
The table below breaks down the information presented in the benchmarking log messages (all figures are for the most recent 10-second time window):
Measure | Meaning | Units |
---|---|---|
Pub rate |
The rate at which messages are published to the topic | Messages per second / Megabytes per second |
Cons rate |
The rate at which messages are consumed from the topic | Messages per second / Megabytes per second |
Backlog |
The number of messages in the messaging system’s backlog | Number of messages (in thousands) |
Pub latency (ms) avg |
The time taken by the producer to publish the message — values >60s are not captured | Milliseconds (average, 50th percentile, 99th percentile, and 99.9th percentile, and maximum) |
Pub delay (µs) avg |
The time that message production is delayed relative to the ideal throughput — values >60s are not captured | Microseconds (average, 50th percentile, 99th percentile, and 99.9th percentile, and maximum) |
If you see a large or increasing Pub delay
it is a sign that the workload configuration cannot achieve the requested throughput.
At the end of each workload, you’ll see a log message that aggregates the results:
22:19:20.577 [main] INFO - ----- Aggregated Pub Latency (ms) avg: 1.8 - 50%: 1.7 - 95%: 2.8 - 99%: 3.0 - 99.9%: 8.0 - 99.99%: 17.1 - Max: 58.9
You’ll also see a message like this that tells into which JSON file the benchmarking results have been saved (all JSON results are saved to the /opt/benchmark
directory):
22:19:20.592 [main] INFO - Writing test result into 1-topic-1-partition-100b-Kafka-2018-01-29-22-19-20.json
The process explained above will repeat for each benchmarking workload that you run.
Adding a new platform
In order to add a new platform for benchmarking, you need to provide the following:
- A Terraform configuration for creating the necessary AlibabaCloud or AWS resources (example)
- An Ansible playbook for installing and starting the platform on AWS (example)
- An implementation of the Java
driver-api
library (example) - A YAML configuration file that provides any necessary client configuration info (example)
Documentation
Benchmarks