Azure Event Grid is a pub-sub message broker that enables you to integrate your solutions at scale using HTTP pull delivery, HTTP push delivery, and the MQTT broker capability. The MQTT broker capability in Event Grid enables your clients to communicate on custom MQTT topic names using a publish-subscribe messaging model. This capability fulfills the need for MQTT as the primary communication standard for IoT scenarios that drives the digital transformation efforts across a wide spectrum of industries, including but not limited to automotive, manufacturing, energy, and retail.
Performance testing is a crucial aspect of software development, especially for services that manage large numbers of client connections and message transfers. Our team has been dedicated to assessing the performance of our systems from the early stages of development. In this blog post, we provide an overview of our testing approach, share some test results, and discuss the insights we have gained from our efforts.
To present the system operating at the current scale boundaries, the tests were conducted with a namespace with 40 throughput units (TUs). The following list represents the main scale limits that can be achieved using 40 TUs and the latency highlights from our tests shared in the blog.
Note that the latency results in this report are end-to-end latencies for messages. They don’t represent a promise or SLA by Azure Event Grid. Rather, these results are provided to give you an idea of the performance characteristics of Event Grid under controlled test scenarios. Additionally, we are continuously investing in performance improvements to achieve higher scale and reduce latency.
For solutions that use MQTT brokers to communicate, we should pay attention to the fact that we might be working with numerous small and unstable devices that talk on many changing channels determined by MQTT topics.We needed a test environment that could mimic all these devices and have the capacity to grow on multiple aspects that characterize realistic IoT solutions.
For our testing solution cluster, we chose Azure Kubernetes Services (AKS) because it has suitable features for managing and scaling our workloads. We simply put a certain number of simulated devices into one container deployed as a Kubernetes pod, as illustrated in the following image. The pods are running test runners with simulated publisher/subscriber devices that can talk to Event Grid MQTT brokers. The connection lines on the image are random examples, showing a possibility to use multiple brokers by some test scenarios. All the publishers and subscribers have MQTT connections to an Event Grid MQTT broker. Event Grid MQTT brokers can also have a message routing setup to send messages to an Event Grid topic. The routed messages are relayed to Event Hubs and then the test runners read them back to count them, measure latencies, etc. All the test runners are sending telemetry (logs and metrics) to a telemetry collection service. The metrics we collect are about operation and message processing counts and latencies. We put the metrics into one dashboard that lets us monitor test runs closely as they happen.
We split our test scenarios into communication models that we want to imitate. Here are the main scenarios that we have evaluated:
In this scenario, several publishers send messages to an MQTT broker that forwards all the messages to an Event Grid topic. Then, another Azure cloud component (Event Hub in our tests) can receive these messages.
This scenario involves MQTT clients communicating with each other. Publishers are transmitting messages to specific subscribers and each channel of that communication has its own MQTT topic. The topics shown in the image are examples and there could be many incoming and outgoing communication channels used by publishers/subscribers.
This scenario involves one subscriber that receives messages from multiple channels. The real tests have several groups of N to one communication groups, and the N value can change for different test runs.
In contrast to the previous scenario, this one involves publishers sending messages on various channels (topics). Like the previous test, the communication clusters are repeated to achieve high numbers of clients and message rates.
The broadcast scenario has one topic for sending messages and many subscribers for the same topic. The full test setup contains many clusters with one-to-many relationships.
This is the device-to-device communication scenario with 200,000 publishers sending messages to 200,000 subscribers. The overall inbound/outbound message rates are 40,000 messsages/second. All connections belong to a single Event Grid namespace. Half of the messages are using QoS 0 and the other half QoS 1.
The connections are dialed up to 400,000 in the first 15 minutes and they are dialed down in the last 15 minutes of the test. All messages were routed to an external Event Grid topic and then pushed to an Event Hub. The test consumed messages from the external Event Hub to measure the total routed message latency. We also measured messages’ latencies from the time they were published by a publisher simulated device and received by a subscriber simulated device.
The above graphs show:
The above graphs represent routed messages:
The above graphs show operations, active connections, and throughputs:
This test is like the previous one. The difference is that the connections and messages are equally divided into 10 namespaces. We can notice slightly better mqtt end-to-end and routing latencies.
The above charts are like the ones from the previous section. The only difference is that the total published/received message counters are broken per namespace, the other metrics are aggregated for all namespaces together.
As before, the routing metrics are like the ones from the previous test. The messages from all namespaces go to the same Event Grid topic and a single Event Hub.
The operations and connections charts show all 10 namespaces together.
This is a broadcast scenario – each publisher has 1000 subscribers who receive those messages on one topic. This means the outgoing messages are increased by 1,000 times.
These charts show a couple of insights:
The charts above indicate the lower number of active connections in contrast to the device-to-device tests. This makes sense as we have fewer devices that publish data. Throughput graph displays lower inbound values.
This test is a case of maximum broadcast, where one device that publishes messages sends one message every second and those messages are sent to 40,000 subscribers.
These graphs display the extra price to pay in the end-to-end message latencies, as they are additionally multiplied, and the ratio of published vs received messages of 1 in 40,000. The graphs that show the counters are displayed on the scale of minutes.
In conclusion, the above charts indicate an extremely low inbound throughput and the anticipated number of active connections.
We are continuously developing new capabilities and enhancements for our Event Grid MQTT broker. Our testing activities will concentrate on two areas:
Once we confirm our new achievements, we will continue sharing them with everyone.
As you might expect, testing at this scale is not easy and comes with many difficulties. These difficulties can stem from both practical problems and product/test code reliability. Here is an arbitrary list of insights we gained while testing our services so far:
You can learn more about Azure Event Grid by visiting the links below. If you have questions or feedback, you can contact us at askmqtt@microsoft.com.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.