Over the last several months, I have been involved in several critical customer escalations (what we refer to as critsits) for Exchange 2010 and Exchange 2013. As a result of my involvement, I have noticed several common themes and trends. The intent of this blog post is to describe some of these common issues and problems, and hopefully this post will lead you to come to the same conclusion that I have – that many of these issues could have been avoided by taking sensible, proactive steps.
By far, the most common issue was that almost every customer was running out-of-date software. This included OS patches, Exchange patches, Outlook client patches, drivers, and firmware. One might think that being out-of-date is not such a bad thing, but in almost every case, the customer was experiencing known issues that were resolved in current releases. Maintaining currency also ensures an environment is protected from known security defects. In addition, as the software version ages, it eventually goes out of support (e.g., Exchange Server 2010 Service Pack 2).
Software patching is not simply an issue for Microsoft software. You must also ensure that all inter-dependent solutions (e.g., Blackberry Enterprise Server, backup software, etc.) are kept up-to-date for a specific release as this ensures optimal reliability and compatibility.
Microsoft recommends adopting a software update strategy that ensures all software follows N to N-1 policy, where N is a service pack, update rollup, cumulative update, maintenance release, or whatever terminology is used by the software vendor. We strongly recommend that our customers also adopt a similar strategy with respect to hardware firmware and drivers ensuring that network cards, BIOS, and storage controllers/interfaces are kept up to date.
Customers must also follow the software vendor’s Software Lifecycle and appropriately plan on upgrading to a supported version in the event that support for a specific version is about to expire or is already out of support.
For Exchange 2010, this means having all servers deployed with Service Pack 3 and either Rollup 7 or Rollup 8 (at the time of this writing). For Exchange 2013, this means having all servers deployed with Cumulative Update 6 or Cumulative Update 7 (at the time of this writing).
For environments that have a hybrid configuration with Office 365, the servers participating in the hybrid configuration must be running the latest version (e.g., Exchange 2010 SP3 RU8 or Exchange 2013 CU7) or the prior version (e.g., Exchange 2010 SP3 RU7 or Exchange 2013 CU6) in order to maintain and ensure compatibility with Office 365. There are some required dependencies for hybrid deployments, so it’s even more critical you keep your software up to date if you choose to go hybrid.
Change control is a critical process that is used to ensure an environment remains healthy. Change control enables you to build a process by which you can identify, approve, and reject proposed changes. It also provides a means by which you can develop a historical accounting of changes that occur. Often times I find that customers only leverage a change control process for “big ticket” items, and forego the change control process for what are deemed as “simple changes.”
In addition to building a change control process, it is also critical to ensure that all proposed changes are vetted in a lab environment that closely mirrors production, and includes any 3rdparty applications you have integrated (the number of times I have seen Exchange get updated and heard the integrated app has failed is non-zero, to use a developer’s phrase).
While lab environments provide a great means to validate the functionality of a proposed change, they often do not provide a view on the scalability impact of a change. One way to address this is to leverage a “slice in production” where a change is deployed to a subset of the user population. This subset of the user population can be isolated using a variety of means, depending on the technology (e.g., dedicated forests, dedicated hardware, etc.). Within Office 365, we use slices in productions a variety of different ways; for example, we leverage them to test (or what we call dogfood) new functionality prior to customer release and we use it as a First Release mechanism so that customers can experience new functionality prior to worldwide deployment.
If you can’t build a scale impact lab, you should at a minimum build an environment that includes all of the component pieces you have in place, and make sure you keep it updated so you can validate changes within your core usage scenarios.
The other common theme I saw is bundling multiple changes together in a single change control request. While bundling multiple changes together may seem innocuous, when you are troubleshooting an issue, the last thing you want to do is make multiple changes. First, if the issue gets resolved, you do not know which particular change resolved the issue. Second, it is entirely possible the changes may exacerbate the current issue.
Failure happens. There is no technology that can change that fact. Disks, servers, racks, network appliances, cables, power substations, pumps, generators, operating systems, applications, drivers, and other services – there is simply no part of an IT service that is not subject to failure.
This is why we use built-in redundancy to mitigate failures. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, front-end and back-end pools, and the like. But redundancy can be prohibitively expensive (as a simple multiplication of cost). For example, the cost and complexity of the SAN-based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to evolve Exchange to integrate key elements of storage directly into its architecture. Every SAN system and every disk will ultimately fail, and implementing a highly-redundant system using SAN technology is cost-prohibitive, so Exchange evolved from requiring expensive, scaled-up, high-performance storage systems, to being optimized for commodity scaled-out servers with commodity low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage failure.
By building a replication architecture into Exchange and optimizing Exchange for commodity hardware, failure modes are predictable from a hardware perspective, and that redundancy can removed from other hardware layers, as well. Redundant NICs, redundant power supplies, etc., can also be removed from the server hardware. Whether it is a disk, a controller, or a motherboard that fails, the end result is the same: another database copy is activated on another server.
The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at scale requires making recovery predictable, which drives the necessity for predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on a network with complex routing configurations, network teaming, RAID, multiple fiber pathways, and so forth.
Removing complex redundancy seems counter-intuitive – how can removing hardware redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model creates a predictable failure mode.
Several of my critsit escalations involved customers with complex architectures where components within the architecture were part of the syst emic issue trying to be resolved:
How can the complexity be reduced? For Exchange, we use predictable recovery models (for example, activation of a database copy). Our Preferred Architecture is designed to reduce complexity and deliver a symmetrical design that ensures that the user experience is maintained when failures occur.
Another concerning trend I witnessed is that customers repeatedly ignored recommendations from their product vendors. There are many reasons I’ve heard to explain away why a vendor’s advice about configuring or managing their own product was ignored, but it’s rare to see a case where a customer honestly knows more about how a vendor’s product works than does the vendor. If the vendor tells you to configure X or update to version Y, chances are they are telling you for a reason, and you would be wise to follow that advice and not ignore it.
Microsoft’s recommendations are grounded upon data- the data we collect during a support call, the data we collect during a Risk Assessment, and the data we get from you. All of this data is analyzed before recommendations are made. And because we have a lot of customers, the collective learnings we get from you plays a big part.
When deploying a new version of software, whether it's Exchange or another product, it's important to follow an appropriate deployment plan. Customers that don't take on the unnecessary risk of running into unexpected issues during the deployment.
Proper planning of an Exchange deployment is imperative. At a minimum, any deployment plan you use should include the following steps:
The last step is important. Far too often, I see customers implement an architecture and then question why the system is overloaded. The landscape is constantly evolving. Years ago, bring your own device (BYOD) was not an option in many customer environments, whereas, now it is becoming the norm. As a result, your messaging environment is constantly changing – users are adapting to the larger mailbox quotas, the proliferation of devices, the capabilities within the devices, etc. These changes affect your design and can consume more resources. In order to account for this, you must baseline, monitor, and evaluate how the system is performing and make changes, if necessary.
To run a successful service at any scale, you must be able to monitor the solution to not only identify issues as they occur in real-time, but to also proactively predict and trend how the user base or user base activity is growing. Performance, event log and protocol logging data provides two valuable functions:
The data collected can also be used to build intelligent reports that expose the overall health of the environment. These reports can then be shared at monthly service reviews that outline the health and metrics, actions taken within the last month, plans for the next month, issues occurring within the environment and steps being taken to resolve the issues.
If you do not have a monitoring solution capable of collecting and storing historical data, you can still collect the data you need.
As I said at the beginning of this article, many of these customer issues could have been avoided by taking sensible, proactive steps. I hope this article inspires you to investigate how many of these might affect your environments, and more importantly, to take steps to resolve them, before you are my next critsit escalation.
Ross Smith IV
Principal Program Manager
Office 365 Customer Experience
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.