Azure Event Grid's Pull Delivery Performance
Published Dec 07 2023 04:11 PM 2,887 Views

Overview

Azure Event Grid is a highly scalable, serverless messaging service you can use to integrate your solutions that include cloud application workloads as well as IoT devices. With its flexible consumption methods that include HTTP Push and HTTP Pull delivery, your cloud applications are integrated asynchronously and decoupled to attain greater independent scalability and overall resiliency. Event Grid’s MQTT broker feature enables bidirectional device-to-device, device-to-cloud, and cloud-to-device communication. The kind of data you can transmit through MQTT includes device telemetry, device control messages, and general application messages.

 

Enterprise applications rely on distributed applications and asynchronous messaging to scale. Following this approach, you have publisher applications that send messages to Event Grid’s HTTP endpoint at rates of up to 40MB/s. Messages are processed by HTTP subscriber clients that connect to Event Grid to read them at rates of up to 80MB/s. With this architecture, publisher and subscriber clients are decoupled and work asynchronously. Clients can scale independently of each other, which is a key concern for distributed applications like microservices. It is for this kind of scenario for which we recently introduced Pull delivery with a new Namespace resource.  With Pull Delivery, subscriber clients can poll Event Grid for messages at their own pace and timing allowing them to adapt to varying workloads.

 

As scale and performance are very important to enterprise applications, the Azure Event Grid team conducted performance tests under several conditions to illustrate the service characteristics under load. All tests used HTTP to publish and read messages with the Pull delivery API. The tests were orchestrated with the help of Azure Load Testing. Event Grid Namespaces were configured with different throughput capacity configurations to match that of the intended load scenario. The following sections report on their results. First, it is fitting to clarify a couple of concepts to understand the report.

 

Concepts

  • A Throughput Unit (TU) is a unit of namespace capacity equivalent to a data rate of 1MB/s.
  • When configuring capacity for a namespace, the ratio of ingress to egress capacity when using HTTP is 1:2. For example, if a namespace has a capacity of 10 TUs, it supports ingress rates of up to 10MB/s and egress rates of up to 20MB/s.
  • A receive operation is an operation performed by a connected client to read messages from Event Grid. This operation is provided by the HTTP Pull delivery API.
  • An acknowledge operation is another Pull delivery operation used to delete messages from Event Grid that have been successfully received and processed by the client application.
  • An event is a kind of message that typically announces a fact about an application, such as a state change.

 

Performance Results

Our performance results follow. Please note that the latency results in this report don’t represent a promise or SLA by Azure Event Grid. Rather, these results are provided to give you an idea of the performance characteristics of Event Grid under controlled test scenarios.

 

Scenario 1: 1,000 1KB events/s (1 MB/s) ingress and egress

In this scenario, the test client publishes 1,000 events/sec and receives and acknowledges those messages.

 

Setup

  • 1 Namespace with 1TU
  • 1 namespace topic
  • 1 event subscription
  • Event size: 1KB
  • Publish request event batch size: 15.

Result

 

Throughput
  • Publish throughput quickly reaches 1K events/s and maintains this average throughput consistently.
  • Receive and acknowledge throughput keeps pace with publish throughput.

1K-IN-1K-OUT-Throughput.png

                                                                 Y-axis: events per second.    X-axis: test time duration.

Note: Acknowledge operations are immediately followed by receive operations in our tests. Therefore, the receive line chart follows the same pattern as that of acknowledge operations.

 

Latency
  • The average publish latency is consistently maintained at ~10ms and at the 99th percentile (P99), latency is about 50ms.

1K-IN-1K-OUT-Latency.png

                                                                       Y-axis: milliseconds.    X-axis: test time duration.

 

 

Scenario 2: 10,000 1KB events/s (10 MB/s) ingress and egress

 

Setup

  • 1 Namespace with 10 TUs
  • 1 namespace topic
  • 1 event subscription
  • Event size: 1KB
  • Publish request event batch size: 15.

Results

 

Throughput
  • Publish throughput scales to 10K events/s and maintains this average throughput consistently.
  • Receive and acknowledge throughput keeps up with publish throughput with some blips.

 

10K-IN-10K-OUT-Throughput.png

                                                                  Y-axis: events per second.    X-axis: test time duration.

 

Latency
  • The average publish latency is consistently maintained at ~11ms and at P99, latency is about 50ms.

10K-IN-10K-OUT-Latency.png

                                                                         Y-axis: milliseconds.    X-axis: test time duration.

 

Looking into the future

While public ingress rate limits currently stand at events/s or 40 MB/s, Azure Event Grid is gearing up to support up to 100K events/s (100 MB/s). Our goal is to preserve lower latencies and smoothen hiccups at all data rates levels.

Important: The following sections provide test results for data rates not yet supported. You cannot configure namespaces to use more than 40 TUs.

 

Scenario: 100,000 1KB events/s (100 MB/s) ingress and egress

 

Setup

  • 1 Namespace with 100 TUs.
  • 1 namespace topic
  • 1 event subscription
  • Event size: 1KB
  • Publish request event batch size: 15.

Results

 

Throughput
  • Publish throughput reaches 100K events/s in ~20 minutes and then maintains throughput consistently. Behind the scenes, the platform reacts to the incoming load and automatically adds more partitions. We are working on an improvement to additionally reduce this scale up time.
  • Receive and acknowledge throughput takes about 45 minutes to scale up to match the publish throughput and maintain those levels.  

100K-IN-100K-OUT-Throughput.png

                                                                  Y-axis: events per second.    X-axis: test time duration.

 

Latency
  • Initially during the scaling out period, publish latencies are high followed by an average publish latency that stays consistent at about ~11ms.

Graph showing autoscaling period

 

100K-IN-100K-OUT-Latency-Overall.png

                                                                        Y-axis: milliseconds.    X-axis: test time duration.

Graph after autoscaling period

100K-IN-100K-OUT-Latency-After-Autoscale.png

                                                                          Y-axis: milliseconds.    X-axis: test time duration.

 

Multi-topic publishing and receiving under a single namespace with 85 TUs

In real-world scenarios, events come in all shapes and sizes. Azure Event Grid is designed to excel in handling this diversity. In our tests, the published event sizes ranged from a compact 600 bytes to a substantial 25KB. Publish rates ranged from 30 to 12000 events/s.

 

Setup

  • 1 Namespace with 85TU
  • 95 namespace topics.
  • 95 event subscriptions.
  • Event size: Ranging from 600 bytes to 25KB.
  • Publish request event batch size: Ranging from 1-10.
  • Publish throughput per topic: Ranging from 30 to 12000 events/s.
  • Ingress throughput (in bytes at Namespace level): ~85 MB/s.
  • Egress throughput (in bytes at Namespace level): ~85 MB/s.

Result

 

Throughput
  • Publish throughput scales to ~35K events/s and maintains this average throughput consistently.
  • Receive and acknowledge operations throughput aligns seamlessly to publishing rates.
  • Whenever receive operations throughput momentarily falters, it swiftly catches up.
  • With a diverse event size, total ingress data rate reached ~85MB/s throughout the test run.

A-Events-Throughput.png

                                                                   Y-axis: events per second.    X-axis: test time duration.

 

Publish and receive MB/s throughput

 

A-Bytes-Throughput.png

                                                                Y-axis: Megabytes per second.    X-axis: test time duration.

 

 Latency

Publish latency across all 95 topics.

  • Average observed: ~11ms.
  • Observed at P99: ~50ms.

A-Latency.png

                                                                          Y-axis: milliseconds.    X-axis: test time duration. 

 

What’s Next?

Messaging infrastructure plays a key role for mission critical applications. As applications start from serving a small number of users to serving millions of users or data sources in a few years, the underlying messaging infrastructure needs to keep up with the increasing demand. Whether you're orchestrating moderate or intensive workloads, Azure Event Grid adapts to varying data rates enabling your solutions to scale. We will continue investing in performance improvements to achieve higher throughput and reduce jitters and hiccups.

 

Resources

You can learn more about Azure Event Grid by visiting the links below. If you have questions, you can contact us at mailto:askgrid@microsoft.com.

 

 

Version history
Last update:
‎Dec 07 2023 05:24 PM
Updated by: