Azure Functions is Azure’s primary serverless code service used in production by hundreds of thousands of customers who run several trillions of executions on it monthly across many different hosting options. It was first released in early 2016 and since then we have learnt a lot from our customers on what works and where they would like to see more.
At Ignite this year we are very excited to be releasing the Flex Consumption plan to GA. This is our "no-compromise SKU" where customers get all the features that they have been looking for across scale, networking, ease of use while supporting scale to zero "Serverless" semantics. In the Public Preview time frame, we have had thousands of apps and hundreds of customers try out the service and give us valuable feedback. This blog post is a bit of peel back behind the curtains look at some of the tech behind building Flex Consumption SKU.
Flex Consumption: Burst scale your apps with networking support
This SKU addresses a lot of the feedback that we have received over the years on the Functions Consumption plans - including faster scale, more instance sizes, VNET support, higher instance limits and much more. We have looked at each part of the stack and made improvements at all levels. There are many new, first in Functions, capabilities including:
- Faster speed of scaling than before - hundreds of instances in a minute.
- User controlled per-instance concurrency
- Scale to many more instances than before (up to 1000)
- Support for VNET integration (while still supporting scale to zero semantics)
- Supports always allocated workers/reserved instances to enable zero cold start
- Supports multiple memory sizes (including smaller sizes)
- Availability Zones support
Purpose built backend “Legion”
To enable Flex Consumption, we have created a brand-new purpose-built backend internally called Legion.
To host customer code, Legion relies on nested virtualization on Azure VMSS. This gives us the Hyper-V isolation that is a pre-requisite for hostile multi-tenant workloads. Legion was built right from the outset to support scaling to thousands of instances with VNET injection. Efficient use of subnet IP addresses by use of kernel level routing was also a unique achievement in Legion.
For all languages, functions have a strict goal for cold start. To achieve this cold start metric for all languages and versions, and to support functions image update for all these variants, we had to create a construct called Pool Groups that allows functions to specify all the parameters of the pool, as well as networking and upgrade policies.
For GA, we also moved to using Regional level pools (instead of stamp level loops) which helped us better amortize compute and made sure we have enough capacity for burst scale scenarios.
All this work led us to a solid, scalable and fast infrastructure on which to build Flex Consumption on.
“Trigger Monitor” – scale to 0 and scale out with network restrictions
Flex Consumption also introduces networking features to limit access to the Function app and to be able to trigger on event sources which are network restricted. Since these event sources are network restricted the multi-tenant scaling component scale controller that monitors the rate of events to determine to scale out or scale in cannot access them. In the Elastic Premium plan in which we scale down to 1 instance – we solved this by that instance having access to the network restricted event source and then communicating scale decisions to the scale controller. However, in the Flex Consumption plan we wanted to scale down to 0 instances.
To solve this in Flex Consumption SKU, we implemented a small scaling component we call “Trigger Monitor” that is injected into the customers VNET. This component is now able to access the network restricted event source. The scale controller now communicates with this component to get scaling decisions.
Scaling Http based apps based on concurrency
When scaling Http based workloads on Function apps our previous implementation used an internal heuristic to decide when to scale out. This heuristic was based on Front End servers,: pinging the workers that are currently running customers workload and deciding to scale based on the latency of the responses. This implementation used SQL Azure to track workers and assignments for these workers.
In Flex Consumption we have rewritten this logic where now scaling is based on user configured concurrency. User configured concurrency gives customers flexibility in deciding based on the language and workload what concurrency they want to set per instance. So, for example, for Python customers they don’t have to think about multithreading and can set concurrency =1 (which is also the default for Python apps). This approach makes the scaling behavior predictable, and it gives customers the ability to control the cost vs performance tradeoff – if they are willing to tolerate the potential for higher latency, they might unlock cost savings by running each worker at higher levels of concurrency.
In our implementation, we use "request slots" that are managed by the Data Role. We split instances into "request slots" and assign them to different Front End servers. For example: If the per-instance concurrency is set to 16, then once the Data Role chooses an instance to allocate a Function app to, there are 16 request slots that it can hand out to Front Ends. It might give all 16 to a single Front End, or share them across multiple. This removes the need for any coordination between Front Ends – they can use the request slots they receive as much as they like, with the restriction of only one concurrent request per request slot. Also, this implementation uses Cosmos DB to track assignments and workers.
Along with the Legion as the compute provider, significantly large compute allocation per app and rapid scale in and capacity reclamation allows us to give customers much better experience than before.
Scaling Non-Http based apps based on concurrency
Similar to Http apps, we have also enabled Non-Http based apps to scale based on concurrency. We refer to this as Target Based Scaling. From an implementation perspective we have moved to have various extensions implement scaling logic within the extension and the scale controller hosts these extensions. This unifies the scaling logic in one place and unifies all scaling based on concurrency.
Moving configuration to the Control Plane
One more change that we are making directionally based on feedback from our customers is to move from using AppSettings for various configuration properties to moving them to the Control Plane. For Public Preview we are doing this for the areas of Deployment, Scaling, Language. This is an example configuration which shows the new Control Plane properties.
Availability Zones Support
Availability Zones are separated groups of datacenters within an Azure region that are designed such that if one zone experiences an outage, then regional services, capacity and high availability are supported by the remaining zones. We have implemented this capability in the zone-redundant category - where-in compute will automatically be distributed such that the user's application runs on various availability zones. (This is capability is coming soon - in early 2025).
The underlying philosophy of allocation of new instances in an AZ enabled app is that it will always be distributed as evenly as possible across zones.
From an engineering perspective work involved making sure every layer of our infrastructure is zone-redundant, changing our compute orchestrator to allocate compute across zones if a particular app is AZ enabled, to keep track of instances across zones so that we can adjust accordingly to keep the balance across zones.
Azure Load Testing integration
Customers have always asked us how to configure their Function apps for optimum throughput. Till now we have just given them guidance to run performance tests on their own. Now they have another option, we are introducing native Integration with Azure Load Testing. A new performance optimizer is now available that enables you to decide the right configuration for your App by helping you to create and run tests by specifying different memory and Http concurrency configurations.