high availability

60 Topics

Released: Exchange Server Role Requirements Calculator 8.3
Today, we released an updated version of the Exchange Server Role Requirements Calculator. This release focuses around two specific enhancements. Exchange 2016 designs now take into account the CU3 improvement that reduces the bandwidth required between active and passive HA copies as the local search instance can read data from its local database copy. The calculator now supports the ability to automatically calculate the number of DAGs and the corresponding number of Mailbox servers that should be deployed to support the defined requirements. This process takes into account memory, CPU cores, and disk configuration when determining the optimal configuration, ensuring that recommended thresholds are not exceeded. As a result of this change, you will find that the Input tab has been rearranged. Specifically, the DAG variables have been moved to the end of the worksheet to ensure that you have completely entered all information before attempting an automatic calculation. As with everything else in the calculator, you can turn the automatic calculation off and manually select the number of Mailbox servers and DAGs you would like to deploy. For all the other improvements and bug fixes, please review the readme or download the update. As always we welcome feedback and please report any issues you may encounter while using the calculator by emailing strgcalc AT microsoft DOT com. Ross Smith IV Principal Program Manager Office 365 Customer Experience
Ross Smith IV
Jan 05, 2023 Place Exchange Team Blog
30KViews
0likes
14Comments
The Exchange 2016 Preferred Architecture
The Preferred Architecture (PA) is the Exchange Engineering Team’s best practice recommendation for what we believe is the optimum deployment architecture for Exchange 2016, and one that is very similar to what we deploy in Office 365. While Exchange 2016 offers a wide variety of architectural choices for on-premises deployments, the architecture discussed below is our most scrutinized one ever. While there are other supported deployment architectures, they are not recommended. The PA is designed with several business requirements in mind, such as the requirement that the architecture be able to: Include both high availability within the datacenter, and site resilience between datacenters Support multiple copies of each database, thereby allowing for quick activation Reduce the cost of the messaging infrastructure Increase availability by optimizing around failure domains and reducing complexity The specific prescriptive nature of the PA means of course that not every customer will be able to deploy it (for example, customers without multiple datacenters). And some of our customers have different business requirements or other needs which necessitate a different architecture. If you fall into those categories, and you want to deploy Exchange on-premises, there are still advantages to adhering as closely as possible to the PA, and deviate only where your requirements widely differ. Alternatively, you can consider Office 365 where you can take advantage of the PA without having to deploy or manage servers. The PA removes complexity and redundancy where necessary to drive the architecture to a predictable recovery model: when a failure occurs, another copy of the affected database is activated. The PA is divided into four areas of focus: Namespace design Datacenter design Server design DAG design Namespace Design In the Namespace Planning and Load Balancing Principles articles, I outlined the various configuration choices that are available with Exchange 2016. For the namespace, the choices are to either deploy a bound namespace (having a preference for the users to operate out of a specific datacenter) or an unbound namespace (having the users connect to any datacenter without preference). The recommended approach is to utilize the unbounded model, deploying a single Exchange namespace per client protocol for the site resilient datacenter pair (where each datacenter is assumed to represent its own Active Directory site - see more details on that below). For example: autodiscover.contoso.com For HTTP clients: mail.contoso.com For IMAP clients: imap.contoso.com For SMTP clients: smtp.contoso.com Each Exchange namespace is load balanced across both datacenters in a layer 7 configuration that does not leverage session affinity, resulting in fifty percent of traffic being proxied between datacenters. Traffic is equally distributed across the datacenters in the site resilient pair, via round robin DNS, geo-DNS, or other similar solutions. From our perspective, the simpler solution is the least complex and easier to manage, so our recommendation is to leverage round robin DNS. For the Office Online Server farm, a namespace is deployed per datacenter, with the load balancer utilizing layer 7, maintaining session affinity using cookie based persistence. Figure 1: Namespace Design in the Preferred Architecture In the event that you have multiple site resilient datacenter pairs in your environment, you will need to decide if you want to have a single worldwide namespace, or if you want to control the traffic to each specific datacenter by using regional namespaces. Ultimately your decision depends on your network topology and the associated cost with using an unbound model; for example, if you have datacenters located in North America and Europe, the network link between these regions might not only be costly, but it might also have high latency, which can introduce user pain and operational issues. In that case, it makes sense to deploy a bound model with a separate namespace for each region. However, options like geographical DNS offer you the ability to deploy a single unified namespace, even when you have costly network links; geo-DNS allows you to have your users directed to the closest datacenter based on their client’s IP address. Figure 2: Geo-distributed Unbound Namespace Site Resilient Datacenter Pair Design To achieve a highly available and site resilient architecture, you must have two or more datacenters that are well-connected (ideally, you want a low round-trip network latency, otherwise replication and the client experience are adversely affected). In addition, the datacenters should be connected via redundant network paths supplied by different operating carriers. While we support stretching an Active Directory site across multiple datacenters, for the PA we recommend that each datacenter be its own Active Directory site. There are two reasons: Transport site resilience via Shadow Redundancy and Safety Net can only be achieved when the DAG has members located in more than one Active Directory site. Active Directory has published guidance that states that subnets should be placed in different Active Directory sites when the round trip latency is greater than 10ms between the subnets. Server Design In the PA, all servers are physical servers. Physical hardware is deployed rather than virtualized hardware for two reasons: The servers are scaled to use 80% of resources during the worst-failure mode. Virtualization adds an additional layer of management and complexity, which introduces additional recovery modes that do not add value, particularly since Exchange provides that functionality. Commodity server platforms are used in the PA. Commodity platforms are and include: 2U, dual socket servers (20-24 cores) up to 192GB of memory a battery-backed write cache controller 12 or more large form factor drive bays within the server chassis Additional drive bays can be deployed per-server depending on the number of mailboxes, mailbox size, and the server’s scalability. Each server houses a single RAID1 disk pair for the operating system, Exchange binaries, protocol/client logs, and transport database. The rest of the storage is configured as JBOD, using large capacity 7.2K RPM serially attached SCSI (SAS) disks (while SATA disks are also available, the SAS equivalent provides better IO and a lower annualized failure rate). Each disk that houses an Exchange database is formatted with ReFS (with the integrity feature disabled) and the DAG is configured such that AutoReseed formats the disks with ReFS: Set-DatabaseAvailabilityGroup <DAG> -FileSystem ReFS BitLocker is used to encrypt each disk, thereby providing data encryption at rest and mitigating concerns around data theft or disk replacement. For more information, see Enabling BitLocker on Exchange Servers. To ensure that the capacity and IO of each disk is used as efficiently as possible, four database copies are deployed per-disk. The normal run-time copy layout ensures that there is no more than a single active copy per disk. At least one disk in the disk pool is reserved as a hot spare. AutoReseed is enabled and quickly restores database redundancy after a disk failure by activating the hot spare and initiating database copy reseeds. Database Availability Group Design Within each site resilient datacenter pair you will have one or more DAGs. DAG Configuration As with the namespace model, each DAG within the site resilient datacenter pair operates in an unbound model with active copies distributed equally across all servers in the DAG. This model: Ensures that each DAG member’s full stack of services (client connectivity, replication pipeline, transport, etc.) is being validated during normal operations. Distributes the load across as many servers as possible during a failure scenario, thereby only incrementally increasing resource use across the remaining members within the DAG. Each datacenter is symmetrical, with an equal number of DAG members in each datacenter. This means that each DAG has an even number of servers and uses a witness server for quorum maintenance. The DAG is the fundamental building block in Exchange 2016. With respect to DAG size, a larger DAG provides more redundancy and resources. Within the PA, the goal is to deploy larger DAGs (typically starting out with an eight member DAG and increasing the number of servers as required to meet your requirements). You should only create new DAGs when scalability introduces concerns over the existing database copy layout. DAG Network Design The PA leverages a single, non-teamed network interface for both client connectivity and data replication. A single network interface is all that is needed because ultimately our goal is to achieve a standard recovery model regardless of the failure - whether a server failure occurs or a network failure occurs, the result is the same: a database copy is activated on another server within the DAG. This architectural change simplifies the network stack, and obviates the need to manually eliminate heartbeat cross-talk. Note: While your environment may not use IPv6, IPv6 remains enabled per IPv6 support in Exchange. Witness Server Placement Ultimately, the placement of the witness server determines whether the architecture can provide automatic datacenter failover capabilities or whether it will require a manual activation to enable service in the event of a site failure. If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAG’s witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage. If your organization does not have a third location, consider placing the witness in Azure; alternatively, place the witness server in one of the datacenters within the site resilient datacenter pair. If you have multiple DAGs within the site resilient datacenter pair, then place the witness server for all DAGs in the same datacenter (typically the datacenter where the majority of the users are physically located). Also, make sure the Primary Active Manager (PAM) for each DAG is also located in the same datacenter. Data Resiliency Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures. Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference number) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”. The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery. The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies when availability is compromised. By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue. To protect against accidental (or malicious) item deletion, Single Item Recovery or In-Place Hold technologies are used, and the Deleted Item Retention window is set to a value that meets or exceeds any defined item-level recovery SLA. With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection. Office Online Server Design At a minimum, you will want to deploy two Office Online Servers in each datacenter that hosts Exchange 2016 servers. Each Office Online Server should have 8 processor cores, 32GB of memory and at least 40GB of space dedicated for log files. Note: The Office Online Server infrastructure does not need to be exclusive to Exchange. As such, the hardware guidance takes into account usage by SharePoint and Skype for Business. Be sure to work with any other teams using the Office Online Server infrastructure to ensure the servers are adequately sized for your specific deployment. The Exchange servers within a particular datacenter are configured to use the local Office Online Server farm via the following cmdlet: Set-MailboxServer <East MBX Server> –WACDiscoveryEndPoint https://oos-east.contoso.com/hosting/discovery Summary Exchange Server 2016 continues in the investments introduced in previous versions of Exchange by reducing the server role architecture complexity, aligning with the Preferred Architecture and Office 365 design principles, and improving coexistence with Exchange Server 2013. These changes simplify your Exchange deployment, without decreasing the availability or the resiliency of the deployment. And in some scenarios, when compared to previous generations, the PA increases availability and resiliency of your deployment. Ross Smith IV Principal Program Manager Office 365 Customer Experience
Ross Smith IV
Aug 05, 2021 Place Exchange Team Blog
266KViews
1like
20Comments
Mailbox Protection with Microsoft 365 – The Video Series
Email and time management seems pretty simple, doesn’t it? How do you know what’s really happening as messages pass from one place to another or when your device tries to send and receive
The_Exchange_Team
Sep 22, 2020 Place Exchange Team Blog
12KViews
4likes
0Comments
Site Resilience Impact on Availability
This article continues the analysis I started in my previous article, DAG: beyond the “A”. We all understand that a good technology solution must have high levels of availability, and that simplicity and redundancy are the two major factors that drive solution availability. More specifically: The simpler the solution (the fewer independent critical components it has), the higher the availability of the solution; The more redundant the solution (the more of multiple, identical components that duplicate each other and provide redundant functionality), the higher the availability of solution. My previous article provides mathematical formulas that allow you to calculate planned availability levels for your specific designs. However, this analysis was performed from a standpoint of a single datacenter (site). Recently the question was asked: how does bringing site resilience to an Exchange design affect the solution's overall level of availability? How much, if any, will we increase overall solution availability if we deploy Exchange in a site resilient configuration? Is it worth it? Availability of a Single Datacenter Solution Let us reiterate some of the important points for a single site/datacenter solution first. Within a datacenter, there are multiple critical components of a solution, and the availability of an entire solution can be analyzed using the principles described in DAG: beyond the “A”, based on the individual availability and redundancy levels of the solution components. Most interestingly, availability depends on the number of redundant database copies deployed. If the availability of a single database copy is A = 1–P (this includes the database copy, and the server and disk that are hosting it), then the availability of a set of N database copies will be A(N) = 1–P N = 1–(1–A) N . The more copies, the higher the availability; the fewer copies, the lower the availability. The graph below illustrates this formula showing the dependency of A(N) on N: Figure 1: Availability dependence on the number of redundant copies Note: All plots in this article were built using the Wolfram Alpha online mathematical computation engine. For example, if A = 90% (value selected on the graph above), and N=4, then A(4) = 99.99%. However, the full solution consists not just of the redundant database copies but of many other critical components, as well: Active Directory, DNS, load balancing, network, power, etc. We can assume that the availability of these components remains the same regardless of how many database copies are deployed. Let’s say the overall availability of all of these components taken together in a given datacenter is A infra . Then, the overall availability of a solution that has N database copies deployed in a single datacenter, is A 1 (N)= A infra x A(N). For example, if A infra = 99.9%, A = 90%, and N=4, then A 1 (4)= 99.89%. Adding Site Resilience So we figured that availability of a single datacenter solution is A 1 (for example, A 1 =99.9%=0.999). Correspondingly, probability of a datacenter failure is P 1 =1–A 1 (in this example, P 1 =0.1%=0.001). Let’s assume that a second site/datacenter has availability of A 2 (it could be the same value as A 1 or it could be different – it depends on the site configuration). Correspondingly, its probability of failure is P 2 =1– A 2 . Site resilience means that if the solution fails in the first datacenter, there is still a second datacenter that can take over and continue servicing users. Therefore, with site resilience the solution will fail only when *both* datacenters fail. If both datacenters are fully independent and don’t share any failure domains (for example, they don’t depend on the same power source or network core switch), then the probability of a failure of both datacenters is P= P 1 xP 2 . Correspondingly, the availability of the solution that involves site resilience based on two datacenters is A = 1–P = 1–(1–A 1 )x(1–A 2 ). Because values of both P 1 and P 2 are very small, the availability of a site resilient solution effectively sums the “number of nines” for both datacenters. In other words, if DC1 has 3 nines availability (99.9%), and DC2 has 2 nines availability (99%), the combined site resilient solution will have 5 nines availability (99.999%). This is actually a very interesting result. For illustration, let us use datacenter tier definitions adopted by ANSI/TIA (Standard TIA-942) and the Uptime Institute, with the availability values for four datacenter tiers defined as follows: Datacenter Tier Definition Availability (%) Tier 1: Basic 99.671% Tier 2: Redundant Components 99.741% Tier 3: Concurrently Maintainable 99.982% Tier 4: Fault Tolerant 99.995% We can see that if we deploy two relatively inexpensive Tier 2 datacenters, the resulting availability of the solution will be higher than if we deploy one very expensive Tier 4 datacenter: Availability (%) Datacenter 1 (DC1) 99.741% Datacenter 2 (DC2) 99.741% Site Resilient Solution (DC1 + DC2) 99.9993% Of course, this logic applies not only to datacenter considerations but also to any solution that involves redundant components. Instead of deploying an expensive single component (e.g., a disk, a server, a SAN, a switch) with a very high level of availability, it might be cheaper to deploy two or three less expensive components with properly implemented redundancy, and it will actually result in better availability. This is one of the fundamental reasons why we recommend using redundant commodity servers and storage in the Exchange Preferred Architecture model. Practical Impact of Site Resilience The advantage of having two site resilient datacenters instead of a single datacenter is obvious if we assume that site resilient solutions are based on the same single datacenter design implemented in each of the two redundant datacenters. For example, if we compare one site with 2 database copies and two sites with 2 database copies in each, obviously the second solution has much higher availability, not so much because of site resilience but simply because now we have more total copies – we moved from 2 total copies to 4. But this is not a fair comparison. What is the effect of the site resilience configuration itself? What if we compare the single datacenter solution and the site resilient solution when they have the same number of copies? For example, single datacenter solution with 4 database copies and a site resilient solution with two sites with 2 database copies in each site (so that both solutions have 4 total database copies). Here the calculation becomes more complex. Using the results from above, let’s say the availability of a solution with the single site and M database copies is A 1 (M) (for example, A 1 (4)=99.9%=0.999). Obviously, availability of the same solution but with fewer database copies will be lower, (for example, A 1 (2)=90%=0.9). Let’s assume similar logic for the second site: let it have N copies and a corresponding availability of A 2 (N). Now we need to compare the following values: Availability of a single site solution with M+N copies: A S = A 1 (M+N) Availability of a site resilient solution with M copies in the 1 st site and N copies in the 2 nd site: A SR = 1–(1–A 1 (M))x(1–A 2 (N)) These values are not very easy to calculate, so let us assume for simplicity that both datacenters are equivalent (A 1 = A 2 ) and both have equal number of copies (M=N). Then we have: A S = A 1 (2N) A SR = 1–(1– A 1 (N)) 2 We know that A 1 = A infra x A(N), and that A(N) = 1–P N = 1–(1–A) N . Since we consider datacenters equivalent, we can assume that A infra is the same for both datacenters. This gives us: A S = A infra x (1–(1–A) 2N ) A SR = 1–(1– A infra x (1–(1–A) N )) 2 These values depend on three variables: A infra , A, and N. To compare these values, let us fix two of the variables and see how the result depends on the third one. One comparison is to see how the values change depending on A if A infra and N are fixed. For example, let A infra = 99% = 0.99, and N=2: Figure 2: Availability dependence on number of redundant copies for the single site and site resilient scenarios The blue line (bottom curved line) represents the single datacenter solution, and the purple line (top curved line) represents the site resilient solution. We can see that site resilient solution always provides better availability, and the difference is steady even if the availability of an individual database copy approaches 1. This is because the availability of other critical components (A infra ) is not perfect. The better A infra (the closer it is to 1), the smaller the difference between the two solutions. To perform another comparison and confirm the last conclusion, let us see how availability changes depending on A infra if A and N are fixed. For example, let A=0.9 and N=2: Figure 3: Availability dependence on the datacenter infrastructure availability for the single site and site resilient scenarios Again, we can see that the site resilient solution provides better availability but the difference between the two availability results is proportional to 1–A infra and so it vanishes when A infra –>1, which confirms the conclusion made earlier. In other words, if your single datacenter has a perfect 100% availability, then site resilient solution is not needed. Now isn’t that obvious without any calculations? The following table illustrates these results: Availability of a single copy 90.000% Datacenter infrastructure availability (A infra ) 99.900% Impact of site resilience # copies/site Availability (%) Single Datacenter 4 99.890010% Two Datacenters 2 99.987922% Difference ~ 1-A infra 0.100% 0.097912% You can leverage this simple Excel spreadsheet (attached to this blog post) that allows you to play with the numbers representing A infra , A, and N (they are formatted in red), and see for yourself how it affects resulting availability values. Summary Deploying a site resilient design increases availability of a solution, but the benefit of site resilience diminishes if a single datacenter solution has high level of availability by itself. Using the formulas above, you can calculate exact availability levels for your specific scenarios if you use proper input values. Note: To avoid confusion, everywhere above we are talking about planned availability. This purely theoretical value demonstrates what can be expectedof a given solution. On comparison, the actually observed availability is a statistical result; in actual operations, you might observe better or worse availability values, but the averages over the long period of monitoring should be close to the theoretical values. Acknowledgement: Author is grateful to Ramon Infante, Director of Worldwide Messaging Community at Microsoft, and Jeffrey Rosen, Solution Architect and US Messaging Community Lead, for helpful and stimulating discussions. Boris Lokhvitsky Delivery Architect Microsoft Consulting Services
The_Exchange_Team
Apr 28, 2020 Place Exchange Team Blog
35KViews
0likes
5Comments
Analyzing Exchange Transaction Log Generation Statistics
Update 1/31/2017: Please see the updated version of this post that explains a significant update to this script. To download the script, see the attachment to this blog post. Overview When designing a site resilient Exchange Server solution, one of the required planning tasks is to determine how many transaction logs are generated on an hourly basis. This helps figure out how much bandwidth will be required when replicating database copies between sites, and what the effects will be of adding additional database copies to the solution. If designing an Exchange solution using the Exchange Server Role Requirements Calculator, the percent of logs generated per hour is an optional input field. Previously, the most common method of collecting this data involved taking captures of the files in each log directory on a scheduled basis (using dir, Get-ChildItem, or CollectLogs.vbs). Although the log number could be extracted by looking at the names of the log files, there was a lot of manual work involved in figuring out the highest the log generation from each capture, and getting rid of duplicate entries. Once cleaned up, the data still had to be analyzed manually using a spreadsheet or a calculator. Trying to gather data across multiple servers and databases further complicated matters. To improve upon this situation, I decided to write an all-in-one script that could collect transaction log statistics, and analyze them after collection. The script is called GetTransactionLogStats.ps1. It has two modes: Gather and Analyze. Gather mode is designed to be run on an hourly basis, on the top of the hour. When run, it will take a single set of snapshots of the current log generation number for all configured databases. These snapshots will be sent, along with the time the snapshots were taken, to an output file, LogStats.csv. Each subsequent time the script is run in Gather mode, another set of snapshots will be appended to the file. Analyze mode is used to process the snapshots that were taken in Gather mode, and should be run after a sufficient amount of snapshots have been collected (at least 2 weeks of data is recommended). When run, it compares the log generation number in each snapshot to the previous snapshot to determine how many logs were created during that period. Script Features Less Data to Collect Instead of looking at the files within log directories, the script uses Perfmon to get the current log file generation number for a specific database or storage group. This number, along with the time it was obtained, is the only information kept in the output log file, LogStats.csv. The performance counters that are used are as follows: Exchange 2013/2016 MSExchangeIS HA Active Database\Current Log Generation Number Exchange 2010 MSExchange Database ==> Instances\Log File Current Generation Note: The counter used for Exchange 2013/2016 contains the active databases on that server, as well as any now passive databases that had been activated on that server at some point since the last reboot. The counter used for Exchange 2010 contains all databases on that server, including all passive copies. To only get data from active databases, make sure to manually specify the databases for that server in the TargetServers.txt file. Alternately you can use the DontAnalyzeInactiveDatabases parameter when performing the analysis to exclude databases that did not increment their log count. Multi Server/Database Support The script takes a simple input file, TargetServers.txt, where each line in the file specifies the server, or server and databases to process. If you want to get statistics for all databases on a server, only the server name is necessary. If you want to only get a subset of databases on a server (for instance if you wanted to omit secondary copies on an Exchange 2010 server), then you can specify the server name, followed by each database you want to process. Built In Analysis Capability The script has the ability to analyze the output log file, LogStats.csv, that was created when run in Gather mode. It does a number of common calculations for you, but also leaves the original data in case any other calculations need to be done. Output from running in Analyze mode is sent to multiple .CSV files, where one file is created for each database, and one more file is created containing the average statistics for all analyzed databases. The following columns are added to the CSV files: Hour: The hour that log stats are being gathered for. Can be between 0 – 23. TotalLogsCreated: The total number of logs created during that hour for all days present in LogStats.csv. TotalSampleIntervalSeconds: The total number of seconds between each valid pair of samples for that hour. Because the script gathers Perfmon data over the network, the sample interval may not always be exactly one hour. NumberOfSamples: The number of times that the log generation was sampled for the given hour. AverageSample: The average number of logs generated for that hour, regardless of sample interval size. Formula: TotalLogsCreated / NumberOfSamples. PercentDailyUsage: The percent of all logs that that particular hour accounts for. Formula: LogsCreatedForHour / LogsCreatedForAllHours * 100. PercentDailyUsageForCalc: The ratio of all logs for this hour compared to all logs for all hours. Formula: LogsCreatedForHour / LogsCreatedForAllHours. verageSamplePer60Minutes: Similar to AverageSample, but adjusts the value like each sample was taken exactly 60 minutes apart. Formula: TotalLogsCreated / TotalSampleIntervalSeconds * 3600. Database Heat Map As of version 2.0, this script now also generates a database heat map when run in Analyze mode. The heat map shows how many logs were generated for each database during the duration of the collection. This information can be used to figure out if databases, servers, or entire Database Availability Groups, are over or underutilized compared to their peers. The database heat map consists of two files: HeatMap-AllCopies.csv: A heat map of all tracked databases, including databases that may have failed over during the collection duration, and were tracked on multiple servers. This heat map shows the server specific instance of each database. Example: HeatMap-DBsCombined.csv: A heat map containing only a single instance of each unique database. In cases where multiple copies of the same database had generated logs, the log count from each will be combined into a single value. Example: Requirements The script has the following requirements; Target Exchange Servers must be running Exchange 2010, 2013, or 2016 PowerShell Remoting must be enabled on the target Exchange Servers, and configured to allow connections from the machine where the script is being executed. Parameters The script has the following parameters: -Gather: Switch specifying we want to capture current log generations. If this switch is omitted, the -Analyze switch must be used. -Analyze: Switch specifying we want to analyze already captured data. If this switch is omitted, the -Gather switch must be used. -ResetStats: Switch indicating that the output file, LogStats.csv, should be cleared and reset. Only works if combined with –Gather. -WorkingDirectory: The directory containing TargetServers.txt and LogStats.csv. If omitted, the working directory will be the current working directory of PowerShell (not necessarily the directory the script is in). -LogDirectoryOut: The directory to send the output log files from running in Analyze mode to. If omitted, logs will be sent to WorkingDirectory. -MaxSampleIntervalVariance: The maximum number of minutes that the duration between two samples can vary from 60. If we are past this amount, the sample will be discarded. Defaults to a value of 10. -MaxMinutesPastTheHour: How many minutes past the top of the hour a sample can be taken. Samples past this amount will be discarded. Defaults to a value of 15. -MonitoringExchange2013: Whether there are Exchange 2013/2016 servers configured in TargetServers.txt. Defaults to $true. If there are no 2013/2016 servers being monitored, set this to $false to increase performance. -DontAnalyzeInactiveDatabases: When running in Analyze mode, this specifies that any databases that have been found that did not generate any logs during the collection duration will be excluded from the analysis. This is useful in excluding passive databases from the analysis. Usage Runs the script in Gather mode, taking a single snapshot of the current log generation of all configured databases: PS C:\> .\GetTransactionLogStats.ps1 -Gather Runs the script in Gather mode, and indicates that no Exchange 2013/2016 servers are configured in TargetServers.txt: PS C:\> .\GetTransactionLogStats.ps1 -Gather -MonitoringExchange2013 $false Runs the script in Gather mode, and changes the directory where TargetServers.txt is located, and where LogStats.csv will be written to: PS C:\> .\GetTransactionLogStats.ps1 -Gather -WorkingDirectory "C:\GetTransactionLogStats" -ResetStats Runs the script in Analyze mode: PS C:\> .\GetTransactionLogStats.ps1 -Analyze Runs the script in Analyze mode, and excludes database copies that did not generate any logs during the collection duration: PS C:\> .\GetTransactionLogStats.ps1 -Analyze -DontAnalyzeInactiveDatabases $true Runs the script in Analyze mode, sending the output files for the analysis to a different directory. Specifies that only sample durations between 55-65 minutes are valid, and that each sample can be taken a maximum of 10 minutes past the hour before being discarded: PS C:\> .\GetTransactionLogStats.ps1 -Analyze -LogDirectoryOut "C:\GetTransactionLogStats\LogsOut" -MaxSampleIntervalVariance 5 -MaxMinutesPastTheHour 10 Example TargetServers.txt The following example shows what the TargetServers.txt input file should look like. For the server1 and server3 lines, no databases are specified, which means that all databases on the server will be sampled. For the server2 and server4 lines, we will only sample the specified databases on those servers. Note that no quotes are necessary for databases with spaces in their names. Output File After Running in Gather Mode When run in Gather mode, the log generation snapshots that are taken are sent to LogStats.csv. The following shows what this file looks like: Output File After Running in Analyze Mode The following shows the analysis for a single database after running the script in Analyze mode: Running As a Scheduled Task Since the script is designed to be run an hourly basis, the easiest way to accomplish that is to run the script via a Scheduled Task. The way I like to do that is to create a batch file which calls Powershell.exe and launches the script, and then create a Scheduled Task which runs the batch file. The following is an example of the command that should go in the batch file: powershell.exe -noninteractive -noprofile -command "& {C:\LogStats\GetTransactionLogStats.ps1 -Gather -WorkingDirectory C:\LogStats}" In this example, the script, as well as TargetServers.txt, are located in C:\LogStats. Note that I specified a WorkingDirectory of C:\LogStats so that if the Scheduled Task runs in an alternate location (by default C:\Windows\System32), the script knows where to find TargetServers.txt and where to write LogStats.csv. Also note that the command does not load any Exchange snapin, as the script doesn’t use any Exchange specific commands. Notes The following information only applies to versions of this script older than 2.0: By default, the Windows Firewall on an Exchange 2013 server running on Windows Server 2012 does not allow remote Perfmon access. I suspect this is also the case with Exchange 2013 running on Windows Server 2008 R2, but haven’t tested. If either of the below errors are logged, you may need to open the Windows Firewall on these servers to allow access from the computer running the script. ERROR: Failed to read perfmon counter from server SERVERNAME ERROR: Failed to get perfmon counters from server SERVERNAME Update: After noticing that multiple people were having issues getting this to work through the Windows Firewall, I tried enabling different combinations of built in firewall rules until I could figure out which ones were required. I only tested on an Exchange 2013 server running on Windows Server 2012, but this should apply to other Windows versions as well. The rules I had to enable were: File and Printer Sharing (NB-Datagram-In) File and Printer Sharing (NB-Name-In) File and Printer Sharing (NB-Session-In) Mike Hendrickson Updates 11/5/2013 added a section on firewall rules to try. 7/17/2014 added a section on running as a scheduled task. 3/28/2016 Version 2.0: Instead of running Get-Counter -ComputerName to remotely access Perfmon counters, the script now uses PowerShell Remoting, specifically Invoke-Command -ComputerName, so that all counter collection is done locally on each target server. This significantly speeds up the collection duration. The script now supports using the -Verbose switch to provide information during script execution. Per Thomas Stensitzki's script variation, added in functionality so that DateTime's can be properly parsed on non-English (US) based computers. Added functionality to generate a database heat map based on log usage. 6/22/2016 Version 2.1: When run in -Gather mode, the script now uses Test-WSMan against each target computer to verify Remote PowerShell connectivity prior to doing the log collection. Added new column to log analysis files, PercentDailyUsageForCalc, which allows for direct copy/paste into the Exchange Server Role Requirements Calculator. Additionally, the script will try to ensure that all rows in the column add up to exactly 1 (requires samples from all 24 hours of the day). Significantly increased performance of analysis operations.
The_Exchange_Team
Apr 28, 2020 Place Exchange Team Blog
47KViews
0likes
22Comments
Responding to Managed Availability
I’ve written a few blog posts now that get into the deep technical details of Managed Availability. I hope you’ve liked them, and I’m not about to stop! However, I’ve gotten a lot of feedback that we also need some simpler overview articles. Fortunately, we’ve just completed documentation on TechNet with an overview of Managed Availability. This was written to address how the feature may be managed day-to-day. Even that documentation doesn’t address how you respond when Managed Availability cannot resolve a problem on its own. This is the very most common interaction with Managed Availability, but we haven’t described how specifically to do so. When Managed Availability is unable to recover the health of a server, it logs an event. Exchange Server has a long history of logging warning, error, and critical events into various channels when things go wrong. However, there are two things about Managed Availability events that make them more generally useful than our other error events: They all go to the same place on a server without any clutter They will only be logged when the standard recovery actions fail to restore health of the component When one of these events is logged on any server in our datacenters, a member of the product group team responsible for that health set gets an immediate phone call. No one likes to wake up at 2 AM to investigate and fix a problem with a server. This keeps us motivated to only have Managed Availability alerts for problems that really matter, and also to eliminate the cause of the alert by fixing underlying code bugs or automating the recovery. At the same time, there is nothing worse than finding out about incidents from customer calls to support. Every time that happens we have painful meetings about how we should have detected the condition first and woken someone up. These two conflicting forces strongly motivate the entire engineering team to keep these events accurate and useful. The GUI Along with a phone call, the on-call engineer receives an email with some information about the failure. The contents of this email are pulled from the event’s description. The path in Event Viewer for these events is Microsoft-Exchange-ManagedAvailability/Monitoring. Error event 4 means that a health set has failed and gives the details of the monitor that has detected the failure. Information event 1 means that all monitors of a health set have become healthy. The Exchange 2013 Management Pack for System Center Operations Manager nicely shows only the health sets that are currently failed instead of the Event Viewer’s method of displaying all health sets that have ever failed. SCOM will also roll up health sets into four primary health groups or three views. The Shell This wouldn’t be EHLO without some in-depth PowerShell scripts. The event viewer is nice and SCOM is great, but not everyone has SCOM. It would be pretty sweet to get the same behavior as SCOM to show only the health sets on a server that are currently failed. Note: these logs serve a slightly different purpose than Get-HealthReport. Get-HealthReport shows the current health state of all of a server’s monitors. On the other hand, events are only logged in this channel once all the recovery actions for that monitor have been exhausted without fixing the problem. Also know that these events detail the failure. If you’re only going to take action based on one health metric, the events in this log is a better one. Get-HealthReport is still the best tool to show you the up-to-the-minute user experience. We have a sample script that can help you with this; it is commented in a way that you can see what we were trying to accomplish. You can get the Get-ManagedAvailabilityAlerts.ps1 script as an attachment to this blog post. Either this method or Event Viewer will work pretty well for a handful of servers. If you have tens or hundreds of servers, we really recommend investing in SCOM or another robust and scalable event-collection system. My other posts have dug deeply into troubleshooting difficult problems, and how Managed Availability gives an overwhelmingly immense amount of information about a server’s health. We rarely need to use these troubleshooting methods when running our datacenters. However, the only thing you need to resolve Exchange problems the way we do in Office 365 is a little bit of event viewer or scheduled script. Abram Jackson Program Manager, Exchange Server
The_Exchange_Team
Apr 27, 2020 Place Exchange Team Blog
34KViews
0likes
7Comments
Exchange 2010 datacenter switchover tool now available
Exchange 2010 includes a feature called Datacenter Activation Coordination (DAC) mode that is designed to prevent split brain at the database level during switchback procedures that are being performed after a datacenter switchover has occurred. One of the side benefits of enabling DAC mode is that it enables you to use the built-in recovery cmdlets to perform the datacenter switchover and switchback. In the real world, there are several different factors that determine what commands to run and when to run them. For example: Are Exchange Servers available in the primary datacenter? Is network connectivity available between the primary and remote datacenter? Is Exchange deployed in a topology with a single Active Directory site or multiple sites? The answers to these questions determine not only the specific commands to run but also where the commands should be run. In addition, administrators need to understand what the desired outcomes of those commands are. For example: How do I verify that stop-databaseavailability group was successful? How do I verify that restore-databaseavailabilitygroup performed the correct steps? When is it appropriate to run start-databaseavailailitygroup? Each of these requires a different set of verification steps before proceeding. And of course as with any process there are those occasional expected errors. With this in mind, I want to introduce the Datacenter Switchover Tool, a kiosk-based PowerPoint application that allows administrators to work through the flow of questions to determine: What commands to run and where to run them How to verify the commands completed successfully. How to walk through a Datacenter Switchover from the Mailbox server / database availability group perspective. To use the tool, simply download it and open it in PowerPoint. Make sure use only the buttons that are available on the screen. The tool will walk you through the correct questions, in the correct order, and provide feedback on the commands to execute and their verification. The location of the tool can be found as an attachment to this blog post. Note: We have now also published the Exchange Server 2010 DAG Switchover GWT (Guided Walkthrough). You can use that as an additional resource to help you get through the switchover process. Enjoy! Tim McMichael
The_Exchange_Team
Apr 27, 2020 Place Exchange Team Blog
30KViews
0likes
16Comments
Exchange Server Role Requirements Calculator Update
v7.8 of the calculator introduces support for Exchange 2016! Yes, that’s right, you don’t need a separate calculator, v7.8 and later supports Exchange 2013 or Exchange 2016 deployments. Moving forward, the calculator is branded as the Exchange Server Role Requirements Calculator. When you open the calculator you will find a new drop-down option in the Input tab that allows you to select the deployment version. Simply choose 2013 or 2016: When you choose 2016, you will notice the Server Multi-Role Configuration option is disabled due to the fact that Exchange 2016 no longer provides the Client Access Server role. As discussed in the Exchange 2016 Architecture and Preferred Architecture articles, the volume format best practice recommendation for Exchange data volumes has changed in Exchange 2016 as we now recommend ReFS (with the integrity feature disabled). By default, for Exchange 2016 deployments, the calculator scripts will default to ReFS (Exchange 2013 deployments will default to NTFS). This is exposed in the Export Mount Points File dialog: The DiskPart.ps1 and CreateDag.ps1 scripts have been updated to support formatting the volume as ReFS (and disabling the integrity feature at the volume level) and enabling AutoReseed support for ReFS. This release also improves the inputs of all dialogs for the distribution scripts by persisting values across the various dialogs (e.g., global catalog values). For all the other improvements and bug fixes, please review the readme or download the update. As always we welcome feedback and please report any issues you may encounter while using the calculator by emailing strgcalc AT microsoft DOT com. Ross Smith IV Principal Program Manager Office 365 Customer Experience
Ross Smith IV
Apr 22, 2020 Place Exchange Team Blog
51KViews
0likes
7Comments
The case of Reply Log Manager not letting lagged copy lag
In a previous blog post Ross Smith IV had explained what the Replay Lag Manager is and what it does. It's a great feature that's somewhat underappreciated. We've seen a few support cases that seemed to have been opened out of the misunderstanding of what the Replay Lag Manager is doing. I wanted to cover a real world scenario I had dealt with recently with a customer that I believe will clarify some things. What is a Replay Lag Manager? In a nutshell, Replay Lag Manager provides higher availability for Exchange through the automatic invocation of a lagged database copy. To further explain, a lagged database copy is a database that Exchange delays committing changes to for a specified period of time. The Replay Lag Manager was first introduced in Exchange 2013 and is actually enabled by default beginning with Exchange 2016 CU1. To understand what it is let's look at the Preferred Architecture (PA) in regards to a database layout. The PA uses 4 database copies like the following: As you can see the 4th copy is a lagged copy. Even though we're showing it in a secondary site, it can exist in any site where a node in the same DAG resides. The Replay Lag Manager will constantly watch for any of the three things to happen to the copies of DB1. Ross Smith's post does a wonderful job of explaining them and how Exchange will take other factors (i.e. disk IO) into consideration before invoking the lagged copy. In general, a log play down will occur: When a low disk space threshold (10,000MB) is reached When the lagged DB copy has physical corruption and needs to be page patched When there are fewer than three available healthy HA copies for more than 24 hours A log "play down" essentially means that Replay Lag Manager is going to force that lagged database copy to catch up on all of the changes to make that copy current. By doing this it ensures that Exchange maintains at least 3 copies of each database. When things are less than perfect… In the real world we don't always see Exchange setup according to our Preferred Architecture because of environment constraints or business requirements. There was a recent case that was the best example of Lag Replay Manager working in the real world. The customer had over 100 DB's, all with 6 copies each. There were 3 copies in the main site and 3 copies in the Disaster Recovery site with one of those copies at each site being lagged. The DB copies were configured like this for all databases. As you can see in this particular instance the lagged copy at Site A was being forced to play down while the other copy showed a Replay Queue Length (RQL) of 4919. This case was opened due to the fact that the lagged DB copy at Site A was not lagging. The customer stated that the DB was lagging fine until recently. However, after a quick check of the Replay Queue Length counter in the Daily Performance Logs it didn't appear to have ever lagged successfully for this copy. So, what we're seeing is the database has 6 copies, 2 lagged but 1 of those lagged copies isn't lagging. Naturally, you may try removing the lag by setting the -ReplayLagTime to 0 then changing back to 7 (or what it was before). You may even try recreating the database copy thinking something was wrong with it. These still don't cause Exchange to lag this copy. The next step is to check if it's actually the Replay Lag Manager causing the log play down. You can quickly see this by running the following command specifying the lagged DB\Server Name. On this example will use SERVER3 as the server hosting the lagged copy of DB1. Get-MailboxDatabaseCopyStatus DB1\SERVER3 | Select Id,ReplayLagStatus Id : DB1\SERVER3 ReplayLagStatus : Enabled:False; PlayDownReason:LagDisabled; ReplaySuspendReason:None; Percentage:0; Configured:7.00:00:00; MaxDelay:1.00:00:00; Actual:00:01:22 What we see is that the ReplayLagStatus is actually disabled and the PlayDownReason is LagDisabled. That tells us it's disabled but it doesn't really give us more detail as to why.. We can dig further by looking at the Microsoft-Exchange/HighAvailability log and we see a pattern of 3 events. The first event we encounter is the 708 but it doesn't give us any more information than the previous command does. Time: 11/31/2017 3:32:55 PM ID: 708 Level: Information Source: Microsoft-Exchange-HighAvailability Machine: server3.domain.com Message: Log Replay for database 'DB1' is replaying logs in the replay lag range. Reason: Replay lag has been disabled. (LogFileAge=00:06:00.8929066, ReasonCode=LagDisabled) The second event we see has a little more information. At this point we know for sure it's the Replay Lag Manger because of its FastLagPlaydownDesired status. Time: 11/31/2017 3:32:55 PM ID: 2001 Level: Warning Source: Microsoft-Exchange-HighAvailability Machine: server3.domain.com Message: Database scanning during passive replay is disabled on 'DB1'. Explanation: FastLagPlaydownDesired. On the third event we see the 738 which actually explains what's going on here. Time: 11/30/2017 1:50:15 PM ID: 738 Level: Information Source: Microsoft-Exchange-HighAvailability Machine: server3.domain.com Message: Replay Lag Manager suppressed a request to disable replay lag for database copy 'DB1\SERVER3' after a suppression interval of 1.00:00:00. Disable Reason: There were database availability check failures for database 'DB1' that may be lowering its availability. Availability Count: 3. Expected Availability Count: 3. Detailed error(s): SERVER4: Server 'server4.domain.com' has database copy auto activation policy configuration of 'Blocked'. SERVER5: Server 'server5.domain.com' has database copy auto activation policy configuration of 'Blocked'. SERVER6: Server 'server6.domain.com' has database copy auto activation policy configuration of 'Blocked'. The "Availability Count: 3. Expected Availability Count: 3." is a tad confusing but the heart the issue is in the detailed errors below that… It's Replay Lag Manager doing it… but why? The entire reason for this blog post comes out of the fact that we've seen the Replay Lag Manager blamed for not letting a lagged copy lag. So, the next step someone will do is to disable it. Please don't do that! It only wants to help! Let's look at how we can resolve the our above example. The logs are showing that it's expecting 3 copies but there aren't 3 available. How can that be? They have at least 4 copies of this database available?!? If we run the following command we see a hint at culprit. Get-mailboxdatabasecopystatus DB1 | Select Identity,AutoActivationPolicy Identity AutoActivationPolicy -------- -------------------- DB1\SERVER1 Unrestricted DB1\SERVER2 Unrestricted DB1\SERVER3 Unrestricted - Lagged Copy (Not lagging) DB1\SERVER4 Blocked DB1\SERVER5 Blocked DB1\SERVER6 Blocked - Lagged Copy (Working) There it is! There are 6 database copies, however, the copies in Site B are all blocked due to the AutoActivationPolicy. Now things are starting to make sense. In the eyes of the Replay Lag Manager those copies in Site B are not available because Exchange cannot activate them automatically. So, what's happening is the Replay Lag Manager only sees the 2 copies (in the green square below) as available. Therefore, it forces a play down of the logs on the lagged copy to maintain it's 3 available copies. That explains why the lagged copy at Site A isn't lagging but why is the lagged copy at Site B working fine? This is because from the perspective of that database there are 3 available copies in Site A once that lagged copy was played down. That's cool… how do I fix it? There are essentially two ways to resolve this example and allow that lagged copy at Site A to properly lag. The first way is to revisit the decision to block Auto Activation at Site B. The mindset in this particular instance was that their other site was actually for Disaster Recovery. They wanted some manual intervention if databases needed to fail over to the DR site. That's all well and good but it doesn't allow for a lagged copy at Site A to work properly due to the Replay Lag Manager. The customer did actually end up allowing 1 copy at the DR site (site B in our example) for Auto Activation. To do this you can run the following command: Set-Mailboxdatabasecopy SERVER4\DB1 -DatabaseCopyAutoActivationPolicy Unrestricted The other option here would be to create another database copy at Site A. Obviously, that's going to require a lot more effort and storage. However, doing this would allow for the Replay Lag Manager to resume lagging on the lagged database copy. I hope this post clarifies some things in regards to the Replay Lag Manager. It's a great feature that will provide some automation in keeping your Exchange databases highly available. Michael Schatte
The_Exchange_Team
Jul 01, 2019 Place Exchange Team Blog
12KViews
0likes
8Comments
DAG Activation Preference Behavior Change in Exchange Server 2016 CU2
Every copy of a mailbox database in a DAG is assigned an activation preference number. This number is used by the system as part of the passive database activation process, and by administrators when performing database balancing operations for a DAG. This number is expressed as the ActivationPreference property of a mailbox database copy. The value for the ActivationPreference property is a number equal to or greater than 1, where 1 is at the top of the preference order. When a DAG is first implemented, by default all active database copies have an ActivationPreference of 1. However, due to the inherent nature of DAGs (e.g., databases experience switchovers and failovers), active mailbox database copies will change hosts several times throughout a DAG's lifetime. As a result of this inherent behavior, a mailbox database may remain active on a database copy which is the not the most preferred copy. Prior to Exchange 2016 Cumulative Update 2 (CU2), Exchange Server administrators had to either manually activate their preferred database copy, or use the RedistributeActiveDatabases.ps1 script to balance the databases copies across a DAG. Starting with CU2 (which will be releasing soon), the Primary Active Manager in the DAG performs periodic discretionary moves to activate the copy that the administrator has defined as most preferred is now built into the product. A new DAG property called PreferenceMoveFrequency has been added that defines the frequency (measured in time) when the Microsoft Exchange Replication service will rebalance the database copies by performing a lossless switchover that activates the copy with an ActivationPreference of 1 (assuming the target server and database copy are healthy). Note: In order to take advantage of this feature, ensure all Mailbox servers within the DAG are upgraded to Exchange 2016 CU2. By default, the Replication service will inspect the database copies and perform a rebalance every one hour. You can modify this behavior using the following command: Set-DatabaseAvailabilityGroup <Name> -PreferenceMoveFrequency <value in the format of 00:00:00> To disable this behavior, configure the PreferenceMoveFrequency value to ([System.Threading.Timeout]::InfiniteTimeSpan). If you are leaving the behavior enabled, and you have created a scheduled task to execute RedistributeActiveDatabases.ps1, you can remove the scheduled task after upgrading the DAG to CU2. We recommend taking advantage of this behavior to ensure that your DAG remains optimally balanced. This feature continues our work to improve the Preferred Architecture by ensuring that users have the best possible experience on Exchange Server. As always, we welcome your feedback. Ross Smith IV Principal Program Manager Office 365 Customer Experience Updates 6/21/16: Updated information on how to disable PreferenceMoveFrequency without requiring a Replication service restart. If you set it to [Timespan]::Zero, you will need to cycle the Replication service.
Ross Smith IV
Jul 01, 2019 Place Exchange Team Blog
60KViews
0likes
34Comments