This post was authored by Gopal Kakivaya, CVP, Windows Azure Development.
Earlier this year at the BUILD conference we announced Azure Service Fabric and released a preview of our SDK. In this inaugural post on the Service Fabric blog, I’ll touch on our motivation for building the platform and explain how it can help you build cloud applications that are more reliable, perform better, and are easier to manage. I’ll also cover our plans for this blog and highlight some of the other ways that we want to connect with you to help make the product better.
Once upon a time not that long ago, developers mostly focused on application features. Going live and running/managing the application day-to-day was left to IT administrators. Thus, important capabilities including performance, scalability, latency, throughput, availability, reliability, the management of the app lifecycle and data integrity were all handled by IT. Maintenance windows and downtimes during app upgrades were the norm. An application was just a handful of monolithic services running on their own machines at best, so updates to all services were bundled together. All this worked because release cycles were long, typically months or years, and IT managers had time to tune things while they waited for the next release. Furthermore, a little downtime every few years to roll out the next big release wasn’t too bad…
Fast forward to today, where applications run continuously in the cloud and release cycles are tiny (as low as hours, or even minutes sometimes). To minimize the impact of dependencies within an application and enable faster, team independent deployments, developers are turning to a more loosely coupled, fine-grained “microservice” architectures over the traditional monolithic, tiered approach which built up many features over time for one mega, higher risk deployment .
There is no separate “IT” team that can handle the operational management of the services given the rapid cadence and number of components. Now developers from the start are expected to deal with the non-functional requirements of their services such as scalability, availability, latency, lifecycle management, data integrity and portability.
As if those changes weren’t disruptive enough, building reliable services in today’s virtual & commoditized infrastructure is not easy. Developers need code for things including failure detection (what was running on this VM when it died?), leader election (who is updating the data now with multiple requests), agreement, replication, failover, monitoring, diagnostics, and lifecycle management/upgrades - in distributed asynchronous systems these are challenging problems to get right. If that wasn’t hard enough, the massive scale of any large service adds another layer of complexity: building geo-scale services in geographically distributed data centers which are not necessarily under your control, along with the additional challenges of working across higher latency network infrastructures.
At Microsoft, we faced the same challenges as we built cloud services out of our on-premises software products like SQL Server and Skype for Business, as well as those services which were “born in the cloud” like Bing’s Cortana. It was inefficient and error prone to solve the cross-cutting requirements separately for each team. The early going was slow going.
Some folks started to have an idea: “Wouldn’t it be great if we had a platform that dealt with all the common aspects of developing, running, and managing services at scale, so that our developers can go back to focus on the functionality that actually matters to them?” And thus, Azure Service Fabric was conceived and designed to serve those real needs for us internally. Now we want to give it to you!
Service Fabric makes it easy for a developer to write highly reliable, scalable, microservice-based applications and then manage it in the real world. And before you ask, yes, Service Fabric is available both on-premise with Windows Server as well in the cloud with Azure. Not just that, we’re also building a port for Linux, so you can pretty much use it everywhere.
Skeptics abound. I can hear someone asking the following questions:
Let’s put those questions to rest - Azure Service Fabric has been powering our products and services on-premises and in Azure for over 5 years. SQL Server, Skype for Business (formerly Lync Server and Lync Online), Cortana, DocDB, EventHub, InTune and others are using this technology. Look at the scale of some of those services in the figure below – if Azure DB with 1.4 million databases, and Bing Cortana with 500 million evals/sec can take a bet on Service Fabric over several years, it’s pretty clear that we’re not just a shiny new toy.
Time for a cool story. Entire data centers going down are a very rare Black Swan event. Most services rely on data centers being available, while coding to be resilient to failures within a data center. With Murphy’s law being what it is, Bing Cortana had a data center outage a while ago. But, because they had the underlying availability provided by Service Fabric (by techniques such as failover and replication), there was no gap in service – the Cortana service didn’t blink (though surely some datacenter managers probably had a long night).
So, we’ve already had a fair amount of hardening. But maybe the SDK we shipped aren’t those bits? The code we released at //Build earlier this year is the same code that is deployed in the data centers – not some handicapped bits, an example, an emulation, or anything like that. You have what we use. With so many organizations moving to the cloud, we decided to release the same technology to make their transition easier and successful.
At its core, Service Fabric is a technology for clustering a number of virtual or physical OS instances into a single ‘fabric’ (pool of resources) for computation and storage. This is called a cluster. Once the cluster is provisioned, developers simply deploy services to the cluster with a declarative model that the Service Fabric runtime uses to dynamically manage the services’ availability, reliability and scalability, rebalancing based on changes in throughput demand and availability due to failures.
In Service Fabric, every microservice registers a unique name with the system naming service. Each microservice may be written as either a stateless or stateful one. Stateless microservices don’t maintain any internal state between requests and must rely on an external data store for data that they need to persist. This is the same approach taken by Azure worker roles. If your service is stateless, you can get the many benefits from the Service Fabric platform including high availability, density, zero downtime upgrades, and such.
One of the core and novel features that Service Fabric enables is stateful compute . In a Stateful microservice, both state and compute are co-located in the same node (or VM). This provides many performance benefits including lower latency and higher throughput. Perhaps, more importantly, it offers a powerful programming paradigm that simplifies application architecture by providing support for state consistency with built-in support transaction semantics and making it simple to partition applications to scale as demand increases. Stateful services also benefit from the same availability, zero downtime upgrades, and other platform provided features that Service Fabric offers. Stay tuned for an upcoming blog post with more details on Stateful services.
Service Fabric currently offers two high-level programming models that abstract away most of the complexity in coding a scalable and reliable service, and let you focus on your application logic. The Reliable Services API should look familiar to anyone who has developed Cloud Services projects, with a platform-provided base class and a set of simple entry points where your code plugs in, such as RunAsync for long-running operations. The Reliable Actors API provides a model that is similar to traditional object-oriented development on a single machine while reaping the core platform benefits of scalability and reliability and providing an Actor-specific concept such as a single-threaded execution environment.
As most organizations building cloud services know, writing the code is often the easy part. Keeping a service available and reliable amidst software upgrades, major fluctuations in customer traffic, and failures or changes in the underlying infrastructure can be extremely challenging. And this can become even more difficult when state is involved as failures to correctly manage services can lead to data loss or corruption. Service Fabric provides a comprehensive set of microservice lifecycle management capabilities, including automated upgrades with monitoring and rollback. With Service Fabric, it is possible to have zero downtime upgrades (aka live upgrades) – the service remains live and available as the update rolls forward from one upgrade domain (a group of nodes in a service fabric cluster) to another. The runtime can monitor health metrics during an upgrade, automatically rolling back or pausing upgrades after detecting unhealthy conditions but before customers are impacted.
In this inaugural post on the team blog, I wanted to give you a brief overview of the capabilities of Service Fabric and set the scene for the deep dives and real-world use cases that will follow in the coming weeks and months. There’s nothing like getting hands-on for really understanding the ins and outs of the platform and its capabilities. Our preview SDK release provides a full Service Fabric cluster that you can run on a single development machine and is the best way to get started along with our documentation and samples .
Finally, after such a long time in development and use internally at Microsoft, we can’t say how incredibly excited we are to be out there publicly for you to use Service Fabric. We want to hear from you! Please reach out with questions, comments, and feedback based on your experience with the product. Here are some places that you can leave feedback.
We look forward to all of the cloud scale applications we are going to be able to build together.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.