The Preferred Architecture
Published Apr 21 2014 06:00 AM 229K Views
Microsoft

During my session at the recent Microsoft Exchange Conference (MEC), I revealed Microsoft’s preferred architecture (PA) for Exchange Server 2013. The PA is the Exchange Engineering Team’s prescriptive approach to what we believe is the optimum deployment architecture for Exchange 2013, and one that is very similar to what we deploy in Office 365.

While Exchange 2013 offers a wide variety of architectural choices for on-premises deployments, the architecture discussed below is our most scrutinized one ever. While there are other supported deployment architectures, other architectures are not recommended.

The PA is designed with several business requirements in mind. For example, requirements that the architecture be able to:

  • Include both high availability within the datacenter, and site resilience between datacenters
  • Support multiple copies of each database, thereby allowing for quick activation
  • Reduce the cost of the messaging infrastructure
  • Increase availability by optimizing around failure domains and reducing complexity

The specific prescriptive nature of the PA means of course that not every customer will be able to deploy it (for example, customers without multiple datacenters). And some of our customers have different business requirements or other needs, which necessitate an architecture different from that shown here. If you fall into those categories, and you want to deploy Exchange on-premises, there are still advantages to adhering as closely as possible to the PA where possible, and deviate only where your requirements widely differ. Alternatively, you can consider Office 365 where you can take advantage of the PA without having to deploy or manage servers.

Before I delve into the PA, I think it is important that you understand a concept that is the cornerstone for this architecture – simplicity.

Simplicity

Failure happens. There is no technology that can change this. Disks, servers, racks, network appliances, cables, power substations, generators, operating systems, applications (like Exchange), drivers, and other services – there is simply no part of an IT services offering that is not subject to failure.

One way to mitigate failure is to build in redundancy. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, and the like. But redundancy by itself can be prohibitively expensive (simple multiplication of cost). For example, the cost and complexity of the SAN based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to step up its investment in the storage stack and to evolve the Exchange application to integrate the important elements of storage directly into its architecture. We recognized that every SAN system would ultimately fail, and that implementing a highly redundant system using SAN technology would be cost-prohibitive. In response, Exchange has evolved from requiring expensive, scaled-up, high-performance SAN storage and related peripherals, to now being able to run on cheap, scaled-out servers with commodity, low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage related failure, while enabling you to deploy large mailboxes at a reasonable cost.

By building the replication architecture into Exchange and optimizing Exchange for commodity storage, the failure mode is predictable from a storage perspective. This approach does not stop at the storage layer; redundant NICs, power supplies, etc., can also be removed from the server hardware. Whether it is a disk, controller, or motherboard that fails, the end result should be the same, another database copy is activated and takes over.

The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at any scale is all about making recovery predictable, which drives the necessity to having predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on the network with complex routing configurations, network teaming, RAID, multiple fiber pathways, etc. Removing complex redundancy seems unintuitive on its face – how can removing redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model, creates a predictable failure mode.

The PA removes complexity and redundancy where necessary to drive the architecture to a predictable recovery model: when a failure occurs, another copy of the affected database is activated.

The PA is divided into four areas of focus:

  1. Namespace design
  2. Datacenter design
  3. Server design
  4. DAG design

Namespace Design

In the Namespace Planning and Load Balancing Principles articles, I outlined the various configuration choices that are available with Exchange 2013. From a namespace perspective, the choices are to either deploy a bound namespace (having a preference for the users to operate out of a specific datacenter) or an unbound namespace (having the users connect to any datacenter without preference).

The recommended approach is to utilize the unbound model, deploying a single namespace per client protocol for the site resilient datacenter pair (where each datacenter is assumed to represent its own Active Directory site - see more details on that below). For example:

  • autodiscover.contoso.com
  • For HTTP clients: mail.contoso.com
  • For IMAP clients: imap.contoso.com
  • For SMTP clients: smtp.contoso.com

namespacedesign
Figure 1: Namespace Design

Each namespace is load balanced across both datacenters in a configuration that does not leverage session affinity, resulting in fifty percent of traffic being proxied between datacenters. Traffic is equally distributed across the datacenters in the site resilient pair, via DNS round-robin, geo-DNS, or other similar solution you may have at your disposal. Though from our perspective, the simpler solution is the least complex and easier to manage, so our recommendation is to leverage DNS round-robin.

In the event that you have multiple site resilient datacenter pairs in your environment, you will need to decide if you want to have a single worldwide namespace, or if you want to control the traffic to each specific datacenter pair by using regional namespaces. Ultimately your decision depends on your network topology and the associated cost with using an unbound model; for example, if you have datacenters located in North America and Europe, the network link between these regions might not only be costly, but it might also have high latency, which can introduce user pain and operational issues. In that case, it makes sense to deploy a bound model with a separate namespace for each region.

Site Resilient Datacenter Pair Design

To achieve a highly available and site resilient architecture, you must have two or more datacenters that are well-connected (ideally, you want a low round-trip network latency, otherwise replication and the client experience are adversely affected). In addition, the datacenters should be connected via redundant network paths supplied by different operating carriers.

While we support stretching an Active Directory site across multiple datacenters, for the PA we recommend having each datacenter be its own Active Directory site. There are two reasons:

  1. Transport site resilience via Shadow Redundancy and Safety Net can only be achieved when the DAG has members located in more than one Active Directory site.
  2. Active Directory has published guidance that states that subnets should be placed in different Active Directory sites when the round trip latency is greater than 10ms between the subnets.

Server Design

In the PA, all servers are physical, multi-role servers. Physical hardware is deployed rather than virtualized hardware for two reasons:

  1. The servers are scaled to utilize eighty percent of resources during the worst-failure mode.
  2. Virtualization adds an additional layer of management and complexity, which introduces additional recovery modes that do not add value, as Exchange provides equivalent functionality out of the box.

By deploying multi-role servers, the architecture is simplified as all servers have the same hardware, installation process, and configuration options. Consistency across servers also simplifies administration. Multi-role servers provide more efficient use of server resources by distributing the Client Access and Mailbox resources across a larger pool of servers. Client Access and Database Availability Group (DAG) resiliency is also increased, as there are more servers available for the load-balanced pool and for the DAG.

Commodity server platforms (e.g., 2U, dual socket servers with no more than 24 processor cores and 96GB of memory, that hold 12 large form-factor drive bays within the server chassis) are use in the PA. Additional drive bays can be deployed per-server depending on the number of mailboxes, mailbox size, and the server’s scalability.

Each server houses a single RAID1 disk pair for the operating system, Exchange binaries, protocol/client logs, and transport database. The rest of the storage is configured as JBOD, using large capacity 7.2K RPM serially attached SCSI (SAS) disks (while SATA disks are also available, the SAS equivalent provides better IO and a lower annualized failure rate). Bitlocker is used to encrypt each disk, thereby providing data encryption at rest and mitigating concerns around data theft via disk replacement.

To ensure that the capacity and IO of each disk is used as efficiently as possible, four database copies are deployed per-disk. The normal run-time copy layout (calculated in the Exchange 2013 Server Role Requirements Calculator) ensures that there is no more than a single copy activated per-disk.

ServerDesign
Figure 2: Server Design

At least one disk in the disk pool is reserved as a hot spare. AutoReseed is enabled and quickly restores database redundancy after a disk failure by activating the hot spare and initiating database copy reseeds.

Database Availability Group Design

Within each site resilient datacenter pair you will have one or more DAGs.

DAG Configuration

As with the namespace model, each DAG within the site resilient datacenter pair operates in an unbound model with active copies distributed equally across all servers in the DAG. This model provides two benefits:

  1. Ensures that each DAG member’s full stack of services is being validated (client connectivity, replication pipeline, transport, etc.).
  2. Distributes the load across as many servers as possible during a failure scenario, thereby only incrementally increasing resource utilization across the remaining members within the DAG.

Each datacenter is symmetrical, with an equal number of member servers within a DAG residing in each datacenter. This means that each DAG contains an even number of servers and uses a witness server for quorum arbitration.

The DAG is the fundamental building block in Exchange 2013. With respect to DAG size, a larger DAG provides more redundancy and resources. Within the PA, the goal is to deploy larger DAGs (typically starting out with an eight member DAG and increasing the number of servers as required to meet your requirements) and only create new DAGs when scalability introduces concerns over the existing database copy layout.

DAG Network Design

Since the introduction of continuous replication in Exchange 2007, Exchange has recommended multiple replication networks for separating client traffic from replication traffic. Deploying two networks allows you to isolate certain traffic along different network pathways and ensure that during certain events (e.g., reseed events) the network interface is not saturated (which is an issue with 100Mb, and to a certain extent, 1Gb interfaces). However, for most customers, having two networks operating in this manner was only a logical separation, as the same copper fabric was used by both networks in the underlying network architecture.

With 10Gb networks becoming the standard, the PA moves away from the previous guidance of separating client traffic from replication traffic. A single network interface is all that is needed because ultimately our goal is to achieve a standard recovery model despite the failure - whether a server failure occurs or a network failure occurs, the result is the same, a database copy is activated on another server within the DAG. This architectural change simplifies the network stack, and obviates the need to eliminate heartbeat cross-talk.

Witness Server Placement

Ultimately, the placement of the witness server determines whether the architecture can provide automatic datacenter failover capabilities or whether it will require a manual activation to enable service in the event of a site failure.

If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAG’s witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage.

DAG Design
Figure 3: DAG (Three Datacenter) Design

If your organization does not have a third location, then place the witness server in one of the datacenters within the site resilient datacenter pair. If you have multiple DAGs within the site resilient datacenter pair, then place the witness server for all DAGs in the same datacenter (typically the datacenter where the majority of the users are physically located). Also, make sure the Primary Active Manager (PAM) for each DAG is also located in the same datacenter.

Data Resiliency

Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures.

Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”.

The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery.

The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive) for more than 24 hours

By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue.

To protect against accidental (or malicious) item deletion, Single Item Recovery or In-Place Hold technologies are used, and the Deleted Item Retention window is set to a value that meets or exceeds any defined item-level recovery SLA.

With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection.

Summary

The PA takes advantage of the changes made in Exchange 2013 to simplify your Exchange deployment, without decreasing the availability or the resiliency of the deployment. And in some scenarios, when compared to previous generations, the PA increases availability and resiliency of your deployment.

Ross Smith IV
Principal Program Manager
Office 365 Customer Experience

32 Comments
Not applicable
What exactly is the corruption that the lagged copy is protecting us from, and how would it be detected? I have seen the comment documented in multiple documents, with no clear description of the specific risk.
Not applicable
Great Framework!!
Not applicable
This article is about the Exchange servers and other uncountable information about Exchange Team blog it is a place of complement.
THANK YOU INDEED.
With best regards Hassan Sayed Isssa
Not applicable
Thank you this team.
Not applicable
Nice to see all the recommendations in one place - great article.

The main concern I have about relying on single item recovery for restoring lost data is what happens if an user completely trashes their mailbox (it happens occasionally). Recovering the individual items if possible is one thing but without being able to recover the folder structure you still have a very unhappy end user.

Any prospect on improvements in this area in the next version of exchange?

Nick
Not applicable
@Petri X - With respect to populating databases, I recommend a random approach - do not isolate sets of users onto particular databases. Randomly distribute. There will always be times when you will have to rebalance mailboxes across databases due to IO concerns, capacity concerns, regional issues, etc. As for shrinking the DB size, see the section I wrote on this topic in http://blogs.technet.com/b/exchange/archive/2012/01/30/3470667.aspx.

If you follow our guidance and scale-out on commodity hardware, I doubt you will see very many 100 database copy servers. I don't think I can count on one hand how many large enterprise customers have 100 databases per-server deployments.

There is no such thing as an archive database. You only have a single database type in E2013 - mailbox database. That database can have multiple copies which are either HA copies or lagged copies. All copies, with the exception of the recovery database, count toward the 100 database limit.

Ross
Not applicable
Great Article.. Thank you Exchange Team
Not applicable
@Bruce A Anderson - there are two types of logical corruption events that we are generally interested in.

Database Logical Corruption:  This is the case where the database pages checksum, but the data on the page is wrong logically.  This can occur when ESE attempts to write a DB page and even though the OS storage stack returns success, the data either never makes it to disk or gets written to the wrong place.  We call this a lost flush. We have addressed this case by implementing a lost flush detection mechanism in the ESE database itself and coupling that with the single database page restore feature. This type of logical corruption event can occur with writing to the log files as well - this is handled via block mode continuous replication. Whenever we see a database logical corruption event, we will analyze it and determine if there is a way for us to solve it from a code perspective.

Store Logical Corruption:  This is the case where data is added/deleted/munged in a way that the user does not expect.  These cases are generally caused by 3rd party client and server applications.  It is generally only corruption in the sense that the user views it as corruption.  The Store just sees it as a series of valid MAPI operations.  The in-place hold feature provides a good deal of protection from this case (since no content can be permanently deleted by the rogue app and there is version control) but there may be scenarios where user mailboxes get so “corrupted” on such a large scale, that it would simply be easier to restore the database to a point back in time and export the user mailbox in a state before the corruption event occurred. This is where the lagged copy and Safety Net feature shine as you don't have to do any data manipulation (no log pruning, no mailbox exports, etc.). Unfortunately, store logical corruption events are not easy to detect, and rely on the most part user notification via help-desk.

Ross
Not applicable
Thank You Ross for the Great Article :)
Not applicable
Thanks Ross! Nice to have all of this in one post! Any PA items in relation to file system on the JBOD arrays? In one of the MEC sessions, it sounded like ReFS was given a non-official recommendation. Any gotcha's if choosing to go that route?
Not applicable
Thanks :) Great Article for the Exchange Server 2013 On-Premises Customers.
Not applicable
Ross,

Nice to see you posting good cool and helpful info again. I read a previous post of your concerning RPC Client Access and wanted to bounce something off of you. I realize its a little late and not directly relevant to this post but hopefully you will find some time to comment on it. TIA.

Our Exchange 2010 environment differs a bit from what appears to be the most common setup using primary and failover data centers. Our company has operations in two regions and each region has a datacenter that services clients from that region.

Naturally each datacenter is in a separate AD site and houses one Exchange 2010 server running all roles. Clients in each region point their Outlook client or web browsers to the Exchange server for their region. We have not implemented CAS arrays but we do have a cross-site DAG cluster going.

Each server has four databases. Two are active databases servicing clients for the server's region and the other two are copies from the other region. I use "Suspend-MailboxDatabaseCopy –ActivationOnly" command on the copies to make sure that any they are not inadvertently activated due to short term WAN outage, latency or DAG member reboot.

So right now things run pretty smooth. Clients get good Outlook performance because the CAS and MBX they are using is close by and has high bandwidth connection. The issue we have, of course, is that in the event of a server or datacenter failure clients need to update their profiles or change the URL they use OWA from. Doing this for 400-600 clients is not ideal to say the least. So much so that in the rare instances that we have had trouble we usually opt for a few hours of downtime than a reconfigure of all the clients. We do not want to go "nuclear" unless we run into real big trouble!

So in our particular case, what options do we have to try to improve this situation? Is there some way a CAS Array will help? Should we just use low TTL DNS records?

BTW, I have doubts about clients using a non-region located CAS or CAS array due to a comment you made:
"I will point out that Outlook RPC is more resilient to network conditions than CAS to MBX RPC"

Does this mean that it is better for Outlook to go across the WAN to use a CAS with high-speed connectivity to MBX server or is it better for Outlook to use high-speed connection to CAS and let CAS go over WAN to MBX?

Any and all input will be greatly appreciated and I apologize again for the late and long post!
Not applicable
@Josh - at this time NTFS is still the best practice. ReFS is supported, our recommendation if you go that route is to turn off the integrity check as that has perf implications.
Not applicable
Couple of things..

How do you populate the DBs? Do you calculate the DBs so, that if all mailboxes are in a quotas your database drives are still not full? Meaning, if your mailboxes grows you do not need to start moving users on the new servers/databases? If you need to move users, then how do you shrink the DB file size?

On which kind scenarios you can see that in one DB you have close to 100 DBs in use?

When you say one server can have 100 DBs, which DBs that includes? Primary, Archive, Lagged, Passive? 25 per DB type or some other way?
Not applicable
@tat386 - Yes you should use a CAS array namespace and follow our datacenter activation steps (http://technet.microsoft.com/en-us/library/dd351049(v=exchg.141).aspx) so that clients aren't forced to use a different namespace.

As for client latency, there is no right answer - it's very subjective with respect to the end user. The higher the latency the more time it will take for the operations to complete, which a user may or may not notice. Given that in E2010, CAS to MBX is server RPC (and not tolerant to latency), you definitely want CAS to MBX to be well connected.

Ross
Not applicable
Why 4 copies's per disk? Assuming it's a 4TB spindle, is the idea to keep the db size small?
Not applicable
Very interesting blog Ross. Most is accurate but honestly the PA is based on least common criteria. I see you don't suggest Virtualization but Exchange and Windows still have the same issues that makes Virtualization effective which is changing workloads, underutilization of systems, and overall system maintenance. With the limited amount of overhead the a hypervisor provides it seems that it would actually provide more scalability than physical hardware if configured properly. It also eliminates the inherent issues with cluster networking problems that have plagued not only Exchange 2013 but also Exchange 2010 which is more OS related than Exchange.

Also, have we seen performance benchmarks for Bitlocker and Exchange 2013 yet? I can't seem to find them.
Not applicable
Great Information Ross! Thank you
Not applicable
Nice Article, Thanks !!
Not applicable
referred links like Active Directory published guidance aren't working
Not applicable
Does Exchange Native Data Protection in Exchange 2013 bring the mailbox back to a point in time including which items were in which folders?

If not then an Exchange backup system is needed because to my knowledge Dumpster 2.0 retains the deleted items but not the folder they were deleted from. Once Exchange can bring a mailbox back to a specific point in time, including the folder hierarchy and item placement in that hierarchy, then Exchange backups still have a valid place.
Not applicable
Thanks for the great post... :)
Not applicable
What exactly is the following quoted text referring to? "The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption." What type of corruption and how would this corruption be detected? Especially in 7 days. I have seen this reference before, with no clear description.
Not applicable
Very interesting! Can you provide som additional details o. The JBOD? You let each of the 10 data disk become a seperate volume and drive letter (or mount in a directory) or do you configure Storage Spaces with Simple/Mirror/Parity?
Not applicable
Great info. It would be great to hear about some of the patching strategies that O365 has adopted and if some large on-prem folks could take away any good ideas. Lastly, does O365 utilize any of the PowerShell DSC coolness?
Not applicable
First ,excellent post and thanks.
1. regarding dns, if we use dns round robin(client gets different response every time) we rely on dns cache right? so better use two ips but disable round robin(at the "internet"/"access level so the clients can recover from a datacenter outage quicker no?
2. how can we utilize even 1gb link when the underlying disk subsystem(sata/sas) cannot even get to these speeds on jbods?(that's one of the main things which was always negative for me when discussing with clients the jbod approach which I support)

Thanks again
Not applicable
Hello Ross,

Great article that compiles main parts of the Exchange 2013 design but may be it was interesting to have a guidance about "Dynamic cluster" with Windows Server 2012 R2 as explained in the Schott blog post.

Any input about that : dynamic quorum and no DAG CNO ?

Thanks a lot.
Not applicable
it would be nice if Microsoft preferred architecture was revealed on the same TIME that the product became GA, and even nicer if Microsoft Exchange advisers would know that before they advice for customers....
Not applicable
great post.

are there any plans to highlight this critical bug on the blog? it would be good to let anyone who hasn't migrated mailboxes yet to not get rid of their old DBs - this is wreaking havoc for us and a lot of others:

http://social.technet.microsoft.com/Forums/en-US/291b583f-f228-4502-b7f0-604499fd8d37/error-database...
Not applicable
@Ross - Thanks that covers my question superbly. I had been hesitant to recommend Exchnage Native Data Protection because that answer was missing. With this, I am much more comfortable in supporting that choice.
Cheers.
Not applicable
Why the recommendation for hosting Exchange on physical servers? Microsoft is propagating all workloads to be virtualized on Hyper-V, so I don't get why Exchange 2013 should be running on physical servers. The arguments given in this article to host Exchange

on physical servers, are quite weak. First, most of the infrastructures I've seen have the capacity to host Exchange in VM's in worst case scenario's, second, the complexity which is added by virtualize Exchange, is actually no complexity, as SysAdmins are

getting more and more proficient into managing virtual platforms. It's like the Exchange department is contradicting Microsoft's vision of putting everything in the cloud (which means in this case virtualizing the workloads). Please enlighten me (-:

Not applicable
Ross, would you still say that traditional backups are unnecessary even when a network worm takes out all IP connected servers? Wouldn't an off-site backup be the only way to recover your servers in that scenario? Remember, there is a reason we have to

apply Windows updates every month ;)

Also, the same threat exists with a malicious attacker (could be an internal disgruntled employee) who targets the host operating systems. Without an off-site backup your organization would have 100% data loss. Then you would be wishing for traditional backups

that are carried off-site.

Version history
Last update:
‎Jul 01 2019 04:18 PM
Updated by: