high availability
60 TopicsThe Exchange 2016 Preferred Architecture
The Preferred Architecture (PA) is the Exchange Engineering Team’s best practice recommendation for what we believe is the optimum deployment architecture for Exchange 2016, and one that is very similar to what we deploy in Office 365. While Exchange 2016 offers a wide variety of architectural choices for on-premises deployments, the architecture discussed below is our most scrutinized one ever. While there are other supported deployment architectures, they are not recommended. The PA is designed with several business requirements in mind, such as the requirement that the architecture be able to: Include both high availability within the datacenter, and site resilience between datacenters Support multiple copies of each database, thereby allowing for quick activation Reduce the cost of the messaging infrastructure Increase availability by optimizing around failure domains and reducing complexity The specific prescriptive nature of the PA means of course that not every customer will be able to deploy it (for example, customers without multiple datacenters). And some of our customers have different business requirements or other needs which necessitate a different architecture. If you fall into those categories, and you want to deploy Exchange on-premises, there are still advantages to adhering as closely as possible to the PA, and deviate only where your requirements widely differ. Alternatively, you can consider Office 365 where you can take advantage of the PA without having to deploy or manage servers. The PA removes complexity and redundancy where necessary to drive the architecture to a predictable recovery model: when a failure occurs, another copy of the affected database is activated. The PA is divided into four areas of focus: Namespace design Datacenter design Server design DAG design Namespace Design In the Namespace Planning and Load Balancing Principles articles, I outlined the various configuration choices that are available with Exchange 2016. For the namespace, the choices are to either deploy a bound namespace (having a preference for the users to operate out of a specific datacenter) or an unbound namespace (having the users connect to any datacenter without preference). The recommended approach is to utilize the unbounded model, deploying a single Exchange namespace per client protocol for the site resilient datacenter pair (where each datacenter is assumed to represent its own Active Directory site - see more details on that below). For example: autodiscover.contoso.com For HTTP clients: mail.contoso.com For IMAP clients: imap.contoso.com For SMTP clients: smtp.contoso.com Each Exchange namespace is load balanced across both datacenters in a layer 7 configuration that does not leverage session affinity, resulting in fifty percent of traffic being proxied between datacenters. Traffic is equally distributed across the datacenters in the site resilient pair, via round robin DNS, geo-DNS, or other similar solutions. From our perspective, the simpler solution is the least complex and easier to manage, so our recommendation is to leverage round robin DNS. For the Office Online Server farm, a namespace is deployed per datacenter, with the load balancer utilizing layer 7, maintaining session affinity using cookie based persistence. Figure 1: Namespace Design in the Preferred Architecture In the event that you have multiple site resilient datacenter pairs in your environment, you will need to decide if you want to have a single worldwide namespace, or if you want to control the traffic to each specific datacenter by using regional namespaces. Ultimately your decision depends on your network topology and the associated cost with using an unbound model; for example, if you have datacenters located in North America and Europe, the network link between these regions might not only be costly, but it might also have high latency, which can introduce user pain and operational issues. In that case, it makes sense to deploy a bound model with a separate namespace for each region. However, options like geographical DNS offer you the ability to deploy a single unified namespace, even when you have costly network links; geo-DNS allows you to have your users directed to the closest datacenter based on their client’s IP address. Figure 2: Geo-distributed Unbound Namespace Site Resilient Datacenter Pair Design To achieve a highly available and site resilient architecture, you must have two or more datacenters that are well-connected (ideally, you want a low round-trip network latency, otherwise replication and the client experience are adversely affected). In addition, the datacenters should be connected via redundant network paths supplied by different operating carriers. While we support stretching an Active Directory site across multiple datacenters, for the PA we recommend that each datacenter be its own Active Directory site. There are two reasons: Transport site resilience via Shadow Redundancy and Safety Net can only be achieved when the DAG has members located in more than one Active Directory site. Active Directory has published guidance that states that subnets should be placed in different Active Directory sites when the round trip latency is greater than 10ms between the subnets. Server Design In the PA, all servers are physical servers. Physical hardware is deployed rather than virtualized hardware for two reasons: The servers are scaled to use 80% of resources during the worst-failure mode. Virtualization adds an additional layer of management and complexity, which introduces additional recovery modes that do not add value, particularly since Exchange provides that functionality. Commodity server platforms are used in the PA. Commodity platforms are and include: 2U, dual socket servers (20-24 cores) up to 192GB of memory a battery-backed write cache controller 12 or more large form factor drive bays within the server chassis Additional drive bays can be deployed per-server depending on the number of mailboxes, mailbox size, and the server’s scalability. Each server houses a single RAID1 disk pair for the operating system, Exchange binaries, protocol/client logs, and transport database. The rest of the storage is configured as JBOD, using large capacity 7.2K RPM serially attached SCSI (SAS) disks (while SATA disks are also available, the SAS equivalent provides better IO and a lower annualized failure rate). Each disk that houses an Exchange database is formatted with ReFS (with the integrity feature disabled) and the DAG is configured such that AutoReseed formats the disks with ReFS: Set-DatabaseAvailabilityGroup <DAG> -FileSystem ReFS BitLocker is used to encrypt each disk, thereby providing data encryption at rest and mitigating concerns around data theft or disk replacement. For more information, see Enabling BitLocker on Exchange Servers. To ensure that the capacity and IO of each disk is used as efficiently as possible, four database copies are deployed per-disk. The normal run-time copy layout ensures that there is no more than a single active copy per disk. At least one disk in the disk pool is reserved as a hot spare. AutoReseed is enabled and quickly restores database redundancy after a disk failure by activating the hot spare and initiating database copy reseeds. Database Availability Group Design Within each site resilient datacenter pair you will have one or more DAGs. DAG Configuration As with the namespace model, each DAG within the site resilient datacenter pair operates in an unbound model with active copies distributed equally across all servers in the DAG. This model: Ensures that each DAG member’s full stack of services (client connectivity, replication pipeline, transport, etc.) is being validated during normal operations. Distributes the load across as many servers as possible during a failure scenario, thereby only incrementally increasing resource use across the remaining members within the DAG. Each datacenter is symmetrical, with an equal number of DAG members in each datacenter. This means that each DAG has an even number of servers and uses a witness server for quorum maintenance. The DAG is the fundamental building block in Exchange 2016. With respect to DAG size, a larger DAG provides more redundancy and resources. Within the PA, the goal is to deploy larger DAGs (typically starting out with an eight member DAG and increasing the number of servers as required to meet your requirements). You should only create new DAGs when scalability introduces concerns over the existing database copy layout. DAG Network Design The PA leverages a single, non-teamed network interface for both client connectivity and data replication. A single network interface is all that is needed because ultimately our goal is to achieve a standard recovery model regardless of the failure - whether a server failure occurs or a network failure occurs, the result is the same: a database copy is activated on another server within the DAG. This architectural change simplifies the network stack, and obviates the need to manually eliminate heartbeat cross-talk. Note: While your environment may not use IPv6, IPv6 remains enabled per IPv6 support in Exchange. Witness Server Placement Ultimately, the placement of the witness server determines whether the architecture can provide automatic datacenter failover capabilities or whether it will require a manual activation to enable service in the event of a site failure. If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAG’s witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage. If your organization does not have a third location, consider placing the witness in Azure; alternatively, place the witness server in one of the datacenters within the site resilient datacenter pair. If you have multiple DAGs within the site resilient datacenter pair, then place the witness server for all DAGs in the same datacenter (typically the datacenter where the majority of the users are physically located). Also, make sure the Primary Active Manager (PAM) for each DAG is also located in the same datacenter. Data Resiliency Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures. Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference number) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”. The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery. The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies when availability is compromised. By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue. To protect against accidental (or malicious) item deletion, Single Item Recovery or In-Place Hold technologies are used, and the Deleted Item Retention window is set to a value that meets or exceeds any defined item-level recovery SLA. With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection. Office Online Server Design At a minimum, you will want to deploy two Office Online Servers in each datacenter that hosts Exchange 2016 servers. Each Office Online Server should have 8 processor cores, 32GB of memory and at least 40GB of space dedicated for log files. Note: The Office Online Server infrastructure does not need to be exclusive to Exchange. As such, the hardware guidance takes into account usage by SharePoint and Skype for Business. Be sure to work with any other teams using the Office Online Server infrastructure to ensure the servers are adequately sized for your specific deployment. The Exchange servers within a particular datacenter are configured to use the local Office Online Server farm via the following cmdlet: Set-MailboxServer <East MBX Server> –WACDiscoveryEndPoint https://oos-east.contoso.com/hosting/discovery Summary Exchange Server 2016 continues in the investments introduced in previous versions of Exchange by reducing the server role architecture complexity, aligning with the Preferred Architecture and Office 365 design principles, and improving coexistence with Exchange Server 2013. These changes simplify your Exchange deployment, without decreasing the availability or the resiliency of the deployment. And in some scenarios, when compared to previous generations, the PA increases availability and resiliency of your deployment. Ross Smith IV Principal Program Manager Office 365 Customer Experience266KViews1like20CommentsThe Preferred Architecture
During my session at the recent Microsoft Exchange Conference (MEC), I revealed Microsoft’s preferred architecture (PA) for Exchange Server 2013. The PA is the Exchange Engineering Team’s prescriptive approach to what we believe is the optimum deployment architecture for Exchange 2013, and one that is very similar to what we deploy in Office 365. While Exchange 2013 offers a wide variety of architectural choices for on-premises deployments, the architecture discussed below is our most scrutinized one ever. While there are other supported deployment architectures, other architectures are not recommended. The PA is designed with several business requirements in mind. For example, requirements that the architecture be able to: Include both high availability within the datacenter, and site resilience between datacenters Support multiple copies of each database, thereby allowing for quick activation Reduce the cost of the messaging infrastructure Increase availability by optimizing around failure domains and reducing complexity The specific prescriptive nature of the PA means of course that not every customer will be able to deploy it (for example, customers without multiple datacenters). And some of our customers have different business requirements or other needs, which necessitate an architecture different from that shown here. If you fall into those categories, and you want to deploy Exchange on-premises, there are still advantages to adhering as closely as possible to the PA where possible, and deviate only where your requirements widely differ. Alternatively, you can consider Office 365 where you can take advantage of the PA without having to deploy or manage servers. Before I delve into the PA, I think it is important that you understand a concept that is the cornerstone for this architecture – simplicity. Simplicity Failure happens. There is no technology that can change this. Disks, servers, racks, network appliances, cables, power substations, generators, operating systems, applications (like Exchange), drivers, and other services – there is simply no part of an IT services offering that is not subject to failure. One way to mitigate failure is to build in redundancy. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, and the like. But redundancy by itself can be prohibitively expensive (simple multiplication of cost). For example, the cost and complexity of the SAN based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to step up its investment in the storage stack and to evolve the Exchange application to integrate the important elements of storage directly into its architecture. We recognized that every SAN system would ultimately fail, and that implementing a highly redundant system using SAN technology would be cost-prohibitive. In response, Exchange has evolved from requiring expensive, scaled-up, high-performance SAN storage and related peripherals, to now being able to run on cheap, scaled-out servers with commodity, low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage related failure, while enabling you to deploy large mailboxes at a reasonable cost. By building the replication architecture into Exchange and optimizing Exchange for commodity storage, the failure mode is predictable from a storage perspective. This approach does not stop at the storage layer; redundant NICs, power supplies, etc., can also be removed from the server hardware. Whether it is a disk, controller, or motherboard that fails, the end result should be the same, another database copy is activated and takes over. The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at any scale is all about making recovery predictable, which drives the necessity to having predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on the network with complex routing configurations, network teaming, RAID, multiple fiber pathways, etc. Removing complex redundancy seems unintuitive on its face – how can removing redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model, creates a predictable failure mode. The PA removes complexity and redundancy where necessary to drive the architecture to a predictable recovery model: when a failure occurs, another copy of the affected database is activated. The PA is divided into four areas of focus: Namespace design Datacenter design Server design DAG design Namespace Design In the Namespace Planning and Load Balancing Principles articles, I outlined the various configuration choices that are available with Exchange 2013. From a namespace perspective, the choices are to either deploy a bound namespace (having a preference for the users to operate out of a specific datacenter) or an unbound namespace (having the users connect to any datacenter without preference). The recommended approach is to utilize the unbound model, deploying a single namespace per client protocol for the site resilient datacenter pair (where each datacenter is assumed to represent its own Active Directory site - see more details on that below). For example: autodiscover.contoso.com For HTTP clients: mail.contoso.com For IMAP clients: imap.contoso.com For SMTP clients: smtp.contoso.com Figure 1: Namespace Design Each namespace is load balanced across both datacenters in a configuration that does not leverage session affinity, resulting in fifty percent of traffic being proxied between datacenters. Traffic is equally distributed across the datacenters in the site resilient pair, via DNS round-robin, geo-DNS, or other similar solution you may have at your disposal. Though from our perspective, the simpler solution is the least complex and easier to manage, so our recommendation is to leverage DNS round-robin. In the event that you have multiple site resilient datacenter pairs in your environment, you will need to decide if you want to have a single worldwide namespace, or if you want to control the traffic to each specific datacenter pair by using regional namespaces. Ultimately your decision depends on your network topology and the associated cost with using an unbound model; for example, if you have datacenters located in North America and Europe, the network link between these regions might not only be costly, but it might also have high latency, which can introduce user pain and operational issues. In that case, it makes sense to deploy a bound model with a separate namespace for each region. Site Resilient Datacenter Pair Design To achieve a highly available and site resilient architecture, you must have two or more datacenters that are well-connected (ideally, you want a low round-trip network latency, otherwise replication and the client experience are adversely affected). In addition, the datacenters should be connected via redundant network paths supplied by different operating carriers. While we support stretching an Active Directory site across multiple datacenters, for the PA we recommend having each datacenter be its own Active Directory site. There are two reasons: Transport site resilience via Shadow Redundancy and Safety Net can only be achieved when the DAG has members located in more than one Active Directory site. Active Directory has published guidance that states that subnets should be placed in different Active Directory sites when the round trip latency is greater than 10ms between the subnets. Server Design In the PA, all servers are physical, multi-role servers. Physical hardware is deployed rather than virtualized hardware for two reasons: The servers are scaled to utilize eighty percent of resources during the worst-failure mode. Virtualization adds an additional layer of management and complexity, which introduces additional recovery modes that do not add value, as Exchange provides equivalent functionality out of the box. By deploying multi-role servers, the architecture is simplified as all servers have the same hardware, installation process, and configuration options. Consistency across servers also simplifies administration. Multi-role servers provide more efficient use of server resources by distributing the Client Access and Mailbox resources across a larger pool of servers. Client Access and Database Availability Group (DAG) resiliency is also increased, as there are more servers available for the load-balanced pool and for the DAG. Commodity server platforms (e.g., 2U, dual socket servers with no more than 24 processor cores and 96GB of memory, that hold 12 large form-factor drive bays within the server chassis) are use in the PA. Additional drive bays can be deployed per-server depending on the number of mailboxes, mailbox size, and the server’s scalability. Each server houses a single RAID1 disk pair for the operating system, Exchange binaries, protocol/client logs, and transport database. The rest of the storage is configured as JBOD, using large capacity 7.2K RPM serially attached SCSI (SAS) disks (while SATA disks are also available, the SAS equivalent provides better IO and a lower annualized failure rate). Bitlocker is used to encrypt each disk, thereby providing data encryption at rest and mitigating concerns around data theft via disk replacement. To ensure that the capacity and IO of each disk is used as efficiently as possible, four database copies are deployed per-disk. The normal run-time copy layout (calculated in the Exchange 2013 Server Role Requirements Calculator) ensures that there is no more than a single copy activated per-disk. Figure 2: Server Design At least one disk in the disk pool is reserved as a hot spare. AutoReseed is enabled and quickly restores database redundancy after a disk failure by activating the hot spare and initiating database copy reseeds. Database Availability Group Design Within each site resilient datacenter pair you will have one or more DAGs. DAG Configuration As with the namespace model, each DAG within the site resilient datacenter pair operates in an unbound model with active copies distributed equally across all servers in the DAG. This model provides two benefits: Ensures that each DAG member’s full stack of services is being validated (client connectivity, replication pipeline, transport, etc.). Distributes the load across as many servers as possible during a failure scenario, thereby only incrementally increasing resource utilization across the remaining members within the DAG. Each datacenter is symmetrical, with an equal number of member servers within a DAG residing in each datacenter. This means that each DAG contains an even number of servers and uses a witness server for quorum arbitration. The DAG is the fundamental building block in Exchange 2013. With respect to DAG size, a larger DAG provides more redundancy and resources. Within the PA, the goal is to deploy larger DAGs (typically starting out with an eight member DAG and increasing the number of servers as required to meet your requirements) and only create new DAGs when scalability introduces concerns over the existing database copy layout. DAG Network Design Since the introduction of continuous replication in Exchange 2007, Exchange has recommended multiple replication networks for separating client traffic from replication traffic. Deploying two networks allows you to isolate certain traffic along different network pathways and ensure that during certain events (e.g., reseed events) the network interface is not saturated (which is an issue with 100Mb, and to a certain extent, 1Gb interfaces). However, for most customers, having two networks operating in this manner was only a logical separation, as the same copper fabric was used by both networks in the underlying network architecture. With 10Gb networks becoming the standard, the PA moves away from the previous guidance of separating client traffic from replication traffic. A single network interface is all that is needed because ultimately our goal is to achieve a standard recovery model despite the failure - whether a server failure occurs or a network failure occurs, the result is the same, a database copy is activated on another server within the DAG. This architectural change simplifies the network stack, and obviates the need to eliminate heartbeat cross-talk. Witness Server Placement Ultimately, the placement of the witness server determines whether the architecture can provide automatic datacenter failover capabilities or whether it will require a manual activation to enable service in the event of a site failure. If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAG’s witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage. Figure 3: DAG (Three Datacenter) Design If your organization does not have a third location, then place the witness server in one of the datacenters within the site resilient datacenter pair. If you have multiple DAGs within the site resilient datacenter pair, then place the witness server for all DAGs in the same datacenter (typically the datacenter where the majority of the users are physically located). Also, make sure the Primary Active Manager (PAM) for each DAG is also located in the same datacenter. Data Resiliency Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures. Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”. The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery. The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios: When a low disk space threshold is reached When the lagged copy has physical corruption and needs to be page patched When there are fewer than three available healthy copies (active or passive) for more than 24 hours By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue. To protect against accidental (or malicious) item deletion, Single Item Recovery or In-Place Hold technologies are used, and the Deleted Item Retention window is set to a value that meets or exceeds any defined item-level recovery SLA. With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection. Summary The PA takes advantage of the changes made in Exchange 2013 to simplify your Exchange deployment, without decreasing the availability or the resiliency of the deployment. And in some scenarios, when compared to previous generations, the PA increases availability and resiliency of your deployment. Ross Smith IV Principal Program Manager Office 365 Customer Experience233KViews1like32CommentsLagged Database Copy Enhancements in Exchange Server 2016 CU1
The high availability capabilities of the lagged database copy are enhanced in the upcoming release of Exchange 2016 Cumulative Update 1. ReplayLagManager As you may recall, lagged copies can care for themselves by invoking automatic log replay to play down the log files in certain scenarios: When a low disk space threshold (10,000MB) is reached When the lagged copy has physical corruption and needs to be page patched When there are fewer than three available healthy HA copies for more than 24 hours Play down based on health copy status requires ReplayLagManager to be enabled. Beginning with Exchange 2016 CU1, ReplayLagManager is enabled by default. You can change this via the following command: Set-DatabaseAvailabilityGroup <DAGName> -ReplayLagManagerEnabled $false Deferred Lagged Copy Play Down When one of the above conditions is triggered, the Replication Service will initiate a play down event for the lagged database copy. However, there are times where this may not be ideal. For example, consider the scenario where there are four database copies on a disk, one passive, one lagged, and two active. Initiating a play down event on the lagged copy has the potential to impact any active copies on that disk – replaying log files generates IO and introduces disk latency as the disk head moves, which impacts users accessing their data on the active copies. To address this concern, beginning with Cumulative Update 1 for Exchange 2016, the lagged copy’s play down activity is tied to the health of the disk by evaluating the disk’s IO latency: If the disk’s read IO latency is above 35ms, the play down event is deferred. In the event that there is a disk capacity concern, the disk latency deferral will be ignored and the lagged copy will play down. Once the disk’s read IO latency drops below 25ms, the play down event is resumed. As a result, deferred lagged copy play down reduces the IO burstiness of lagged copy play down events and ensures that local active copies on the lagged copies disk are not affected. IO sizing of a lagged database copy does not change with this feature (nor does it affect the IO sizing of an active copy); you still must ensure there is available IO headroom in the event that the lagged copy becomes active. Consider the following example: The y axis is disk latency, measured in milliseconds. The x axis is a 24-hour period. As you can see from the graph, between the hours of 1am to 9am, the disk IO latency is below 25ms, meaning that lagged copy replay is allowed. At 10am, the latency exceeds 35ms and this continues until about 2pm; during this time period, lagged copy replay is delayed or deferred. At 2pm, the latency drops below 25ms and lagged copy replay resumes. Latency increases again at 4pm and the process repeats itself. By default, the maximum amount of time that a play down event can be deferred is 24 hours. You can adjust this via the following command: Set-MailboxDatabaseCopy <database name\server> -ReplayLagMaxDelay:<value in the format of 00:00:00> If you want to disable deferred play down, you can set the ReplayLagMaxDelay value to ([TimeSpan]::Zero). The following events are recorded in the Microsoft-Exchange-HighAvailability/Monitoring crimson channel when log replay is deferred or resumed: Event 750 – Replay Lag Manager requested activating replay lag delay (suspending log replay) for database copy '%1\%2' after a suppression interval of %4. Delay Reason: %6" Event 751 – Replay Lag Manager successfully activated replay lag delay (suspended log replay) for database copy '%1\%2'. Delay Reason: %4" Event 752 – Replay Lag Manager failed to activate replay lag delay (suspend log replay) for database copy '%1\%2'. Error: %4" Event 753 – Replay Lag Manager requested deactivating replay lag (resuming log replay) for database copy '%1\%2' after a suppression interval of %4. Reason: %5" Event 754 – Replay Lag Manager successfully deactivated replay lag (resumed log replay) for database copy '%1\%2'. Reason: %4 Event 755 - Replay Lag Manager failed to deactivate replay lag (resume log replay) for database copy '%1\%2'. Error: %4 Event 756 - Replay Lag Manager will attempt to deactivate replay lag (resume log replay) for database copy '%1\%2' because it has reached the maximum allowed lag duration. Detailed Reason: %5 The following events are recorded in the Microsoft-Exchange-HighAvailability/Operational crimson channel when log replay is deferred or resumed: Event 748 – Log Replay suspend/resume state for database '%1' has changed. (LastSuspendReason=%3, CurrentSuspendReason=%4, CurrentSuspendReasonMessage=%5) Event 2050 – Suspend log replay requested for database guid=%1, reason='%2'. Event 2051 – Suspend log replay for database guid=%1 succeeded. Event 2052 – Suspend log replay for database guid=%1 failed: %2. Event 2053 – Resume log replay requested for database guid=%1. Event 2054 – Resume log replay for database guid=%1 succeeded. Event 2055 – Resume log replay for database guid=%1 failed: %2. Summary The changes discussed above continue our work in improving the Preferred Architecture by ensuring that users have the best possible experience on the Exchange platform. As always, we welcome your feedback. Ross Smith IV Principal Program Manager Office 365 Customer Experience60KViews0likes8CommentsDAG Activation Preference Behavior Change in Exchange Server 2016 CU2
Every copy of a mailbox database in a DAG is assigned an activation preference number. This number is used by the system as part of the passive database activation process, and by administrators when performing database balancing operations for a DAG. This number is expressed as the ActivationPreference property of a mailbox database copy. The value for the ActivationPreference property is a number equal to or greater than 1, where 1 is at the top of the preference order. When a DAG is first implemented, by default all active database copies have an ActivationPreference of 1. However, due to the inherent nature of DAGs (e.g., databases experience switchovers and failovers), active mailbox database copies will change hosts several times throughout a DAG's lifetime. As a result of this inherent behavior, a mailbox database may remain active on a database copy which is the not the most preferred copy. Prior to Exchange 2016 Cumulative Update 2 (CU2), Exchange Server administrators had to either manually activate their preferred database copy, or use the RedistributeActiveDatabases.ps1 script to balance the databases copies across a DAG. Starting with CU2 (which will be releasing soon), the Primary Active Manager in the DAG performs periodic discretionary moves to activate the copy that the administrator has defined as most preferred is now built into the product. A new DAG property called PreferenceMoveFrequency has been added that defines the frequency (measured in time) when the Microsoft Exchange Replication service will rebalance the database copies by performing a lossless switchover that activates the copy with an ActivationPreference of 1 (assuming the target server and database copy are healthy). Note: In order to take advantage of this feature, ensure all Mailbox servers within the DAG are upgraded to Exchange 2016 CU2. By default, the Replication service will inspect the database copies and perform a rebalance every one hour. You can modify this behavior using the following command: Set-DatabaseAvailabilityGroup <Name> -PreferenceMoveFrequency <value in the format of 00:00:00> To disable this behavior, configure the PreferenceMoveFrequency value to ([System.Threading.Timeout]::InfiniteTimeSpan). If you are leaving the behavior enabled, and you have created a scheduled task to execute RedistributeActiveDatabases.ps1, you can remove the scheduled task after upgrading the DAG to CU2. We recommend taking advantage of this behavior to ensure that your DAG remains optimally balanced. This feature continues our work to improve the Preferred Architecture by ensuring that users have the best possible experience on Exchange Server. As always, we welcome your feedback. Ross Smith IV Principal Program Manager Office 365 Customer Experience Updates 6/21/16: Updated information on how to disable PreferenceMoveFrequency without requiring a Replication service restart. If you set it to [Timespan]::Zero, you will need to cycle the Replication service.60KViews0likes34CommentsNew Support Policy for Repaired Exchange Databases
The database repair process is often used as a last ditch effort to recover an Exchange database when no other means of recovery is available. The process should only be followed at the advice of Microsoft Support and after determining that all other recovery options have been exhausted. For many years in many versions of Exchange, the repair process has largely been the same. However, that process is changing, based on information Microsoft has gathered from an extensive analysis of support cases. In short, Microsoft is changing the support policy for databases that have had a repair operation performed on them. Originally a database was supported if the repair was performed using ESEUTIL and ISINTEG/repair cmdlets. Under the new support policy, any database where the repair count is greater than 0 will need to be evacuated – all mailboxes on such a database will need to be moved to a new database. Existing Repair Process The process consists of three steps: Repair the database at the page level Defragmentation of the database to restructure and recreate the database Repair of the logical structures within the database Step 1 of the repair process is accomplished by using ESEUTIL /p. This is typically performed when there is page level corruption in the database - for example, a -1018 JET error, or when a database is left in dirty shutdown state as the result of not having the necessary log files to bring the database to a clean shutdown state. After executing ESEUTIL /p you are prompted to confirm that data loss may result. Selecting OK is required to continue. [PS] C:\Program Files\Microsoft\Exchange Server\Mailbox\First Storage Group>eseutil /p '.\Mailbox Database.edb' Extensible Storage Engine Utilities for Microsoft(R) Exchange Server Version 08.03 Copyright (C) Microsoft Corporation. All Rights Reserved. Initiating REPAIR mode... Database: .\Mailbox Database.edb Temp. Database: TEMPREPAIR4520.EDB Checking database integrity. The database is not up-to-date. This operation may find that this database is corrupt because data from the log files has yet to be placed in the database. To ensure the database is up-to-date please use the 'Recovery' operation. Scanning Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| . Rebuilding MSysObjectsShadow from MSysObjects. Scanning Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| ................................................... Checking the database. Scanning Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| ................................................... Scanning the database. Scanning Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| ................................................... Repairing damaged tables. Scanning Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| ................................................... Repair completed. Database corruption has been repaired! Note: It is recommended that you immediately perform a full backup of this database. If you restore a backup made before the repair, the database will be rolled back to the state it was in at the time of that backup. Operation completed successfully with 595 (JET_wrnDatabaseRepaired, Database corruption has been repaired) after 30.187 seconds. At this point, the database should be in a clean shutdown state and the repair process may proceed. This can be verified with ESEUTIL /mh. [PS] C:\Program Files\Microsoft\Exchange Server\Mailbox\First Storage Group>eseutil /mh '.\Mailbox Database.edb' State: Clean Shutdown Step 2 is to defragment the database using ESEUTIL /d. Defragmentation requires significant free space on the volume that will host the temporary database (typically 110% of the size of the database must be available as free disk space). [PS] C:\Program Files\Microsoft\Exchange Server\Mailbox\First Storage Group>eseutil /d '.\Mailbox Database.edb' Extensible Storage Engine Utilities for Microsoft(R) Exchange Server Version 08.03 Copyright (C) Microsoft Corporation. All Rights Reserved. Initiating DEFRAGMENTATION mode... Database: .\Mailbox Database.edb Defragmentation Status (% complete) 0 10 20 30 40 50 60 70 80 90 100 |----|----|----|----|----|----|----|----|----|----| ................................................... Moving 'TEMPDFRG3620.EDB' to '.\Mailbox Database.edb'... DONE! Note: It is recommended that you immediately perform a full backup of this database. If you restore a backup made before the defragmentation, the database will be rolled back to the state it was in at the time of that backup. Operation completed successfully in 7.547 seconds. Step 3 is the logical repair of the objects within the database. The method used to accomplish this varies by Exchange version. In Exchange 2007, ISINTEG is used to perform the logical repair, as illustrated in the following example: C:\>isinteg -s wingtip-e2k7 -fix -test alltests -verbose -l c:\isinteg.log Databases for server wingtip-e2k7: Only databases marked as Offline can be checked Index Status Database-Name Storage Group Name: First Storage Group 1 Offline Mailbox Database Enter a number to select a database or press Return to exit. 1 You have selected First Storage Group / Mailbox Database. Continue?(Y/N)y Test Categorization Tables result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Restriction Tables result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Search Folder Links result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s);time: 0h:0m:0s Test Global result: 0 error(s); 0 warning(s); 0 fix(es); 1 row(s); time: 0h:0m:0s Test Delivered To result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Repl Schedule result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time:0h:0m:0s Test Timed Events result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test reference table construction result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Folder result: 0 error(s); 0 warning(s); 0 fix(es); 4996 row(s); time: 0h:0m:2s Test Deleted Messages result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Message result: 0 error(s); 0 warning(s); 0 fix(es); 1789 row(s); time: 0h:0m:0s Test Attachment result: 0 error(s); 0 war ning(s); 0 fix(es); 406 row(s); time: 0h:0m:0s Test Mailbox result: 0 error(s); 0 warning(s); 0 fix(es); 249 row(s); time: 0h:0m:0s Test Sites result: 0 error(s); 0 warning(s); 0 fix(es); 996 row(s); time: 0h:0m:0s Test Categories result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Per-User Read result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time:0h:0m:0s Test special folders result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Message Tombstone result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Test Folder Tombstone result: 0 error(s); 0 warning(s); 0 fix(es); 0 row(s); time: 0h:0m:0s Now in test 20(reference count verification) of total 20 tests; 100% complete. Typically when ISINTEG completes, it advises reviewing the isinteg.log file. At the end of the file is a summary section, listing the number of errors encountered. If the number of errors is greater than zero, you need to re-run the command. Continued repairs need to be performed until the error count reaches 0 or the same number of errors is encountered after two executions. . . . . . SUMMARY . . . . . Total number of tests : 20 Total number of warnings : 0 Total number of errors : 0 Total number of fixes : 0 Total time : 0h:0m:3s In Exchange 2010 and later, ISINTEG was deprecated and certain functions were replaced by the New-MailboxRepairRequest and New-PublicFolderDatabaseRepairRequest cmdlets, both of which allow for repair operations to occur while the database is online. Exchange 2010: [PS] C:\Windows\system32>New-MailboxRepairRequest -Mailbox user252 -CorruptionType SearchFolder,FolderView,AggregateCounts,ProvisionedFolder,MessagePtagCN,MessageID RequestID Mailbox ArchiveMailbox Database Server --------- ------- -------------- -------- ------ 7f499ce3-e Wingtip False Mailbox. WINGTIP-E2K10.Wingti... Exchange 2013: [PS] C:\>New-MailboxRepairRequest -Mailbox User532 -CorruptionType SearchFolder,FolderView,AggregateCounts, ProvisionedFolder,ReplState,MessagePTAGCn,MessageID,RuleMessageClass,RestrictionFolder,FolderACL, UniqueMidIndex,CorruptJunkRule,MissingSpecialFolders,DropAllLazyIndexes,ImapID,ScheduledCheck,Extension1, Extension2,Extension3,Extension4,Extension5 Identity Task Detect Only Job State Progress -------- ---- ----------- --------- -------- a44acf2b {Sea False Queued 0 Upon completion of these repair options, typically the database could be mounted and normal user operations continued. Support Change for Repaired Databases Over the course of the last two years, we have reviewed Watson dumps for Information Store crashes that have been automatically uploaded by customers’ servers. The crashes were cause by inexplicable, seemingly impossible store level corruption. The types of store level corruption varied and they come from many different databases, servers, Exchange versions, and customers. In almost all of these cases one significant fact was noted – the repair count recorded on the database was > 0. When ESEUTIL /p is executed, and a repair to the database is necessary, the repair count is incremented and the repair time is recorded in the header of the database. The repair information stored in the database header will be retained after offline defragmentation . Repair information in the header may be viewed with ESEUTIL /mh. [PS] C:\Program Files\Microsoft\Exchange Server\Mailbox\First Storage Group>eseutil /mh '.\Mailbox Database.edb' Extensible Storage Engine Utilities for Microsoft(R) Exchange Server Version 08.03 Copyright (C) Microsoft Corporation. All Rights Reserved. Initiating FILE DUMP mode... Database: .\Mailbox Database.edb File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:04/05/2015 08:39:24 Rand:2178804664 Computer: cbDbPage: 8192 dbtime: 1059112 (0x102928) State: Clean Shutdown Log Required: 0-0 (0x0-0x0) Log Committed: 0-0 (0x0-0x0) Streaming File: No Shadowed: Yes Last Objid: 4020 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 2 Repair Date: 04/05/2015 08:39:24 Old Repair Count: 0 Last Consistent: (0x0,0,0) 04/05/2015 08:39:25 Last Attach: (0x0,0,0) 04/05/2015 08:39:24 Last Detach: (0x0,0,0) 04/05/2015 08:39:25 Dbid: 1 Log Signature: Create time:00/00/1900 00:00:00 Rand:0 Computer: OS Version: (6.1.7601 SP 1 NLS 60101.60101) Previous Full Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 Previous Incremental Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 Previous Copy Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 Previous Differential Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 Current Full Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 Current Shadow copy backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00 cpgUpgrade55Format: 0 cpgUpgradeFreePages: 0 cpgUpgradeSpaceMapPages: 0 ECC Fix Success Count: none Old ECC Fix Success Count: none ECC Fix Error Count: none Old ECC Fix Error Count: none Bad Checksum Error Count: none Old bad Checksum Error Count: none Operation completed successfully in 0.78 seconds. Uncorrectable corruption can linger in a repaired database and cause store crashes and server instability, we have changed our support policy to require an evacuation of any Exchange database that persistently has a repair count or old repair count equal to or greater than 1. Moving mailboxes (and public folders) to new databases will ensure that the underlying database structure is good, free from any corruption that might not be corrected by the database repair process, and it helps prevent store crashes and server instability. Tim McMichael58KViews0likes9CommentsExchange Server Role Requirements Calculator Update
v7.8 of the calculator introduces support for Exchange 2016! Yes, that’s right, you don’t need a separate calculator, v7.8 and later supports Exchange 2013 or Exchange 2016 deployments. Moving forward, the calculator is branded as the Exchange Server Role Requirements Calculator. When you open the calculator you will find a new drop-down option in the Input tab that allows you to select the deployment version. Simply choose 2013 or 2016: When you choose 2016, you will notice the Server Multi-Role Configuration option is disabled due to the fact that Exchange 2016 no longer provides the Client Access Server role. As discussed in the Exchange 2016 Architecture and Preferred Architecture articles, the volume format best practice recommendation for Exchange data volumes has changed in Exchange 2016 as we now recommend ReFS (with the integrity feature disabled). By default, for Exchange 2016 deployments, the calculator scripts will default to ReFS (Exchange 2013 deployments will default to NTFS). This is exposed in the Export Mount Points File dialog: The DiskPart.ps1 and CreateDag.ps1 scripts have been updated to support formatting the volume as ReFS (and disabling the integrity feature at the volume level) and enabling AutoReseed support for ReFS. This release also improves the inputs of all dialogs for the distribution scripts by persisting values across the various dialogs (e.g., global catalog values). For all the other improvements and bug fixes, please review the readme or download the update. As always we welcome feedback and please report any issues you may encounter while using the calculator by emailing strgcalc AT microsoft DOT com. Ross Smith IV Principal Program Manager Office 365 Customer Experience51KViews0likes7CommentsPer-Server Database Limits Explained
Over the past year, we have discussed the architectural changes that have been introduced in Exchange Server 2013. I wrote about the reduction in complexity that the new server role architecture introduces, as well as, the one of the new capabilities introduced in Exchange 2013, Managed Availability’s recovery oriented computing. However, we haven’t been clear on other architectural changes that have shaped decisions we’ve made about the Exchange 2013 product. For example, the decision on reducing the number of databases supported per-server from 100 to 50. There were three main reasons for this: Server architecture changes Use of commodity hardware Testing Let me explain each of these in more detail. Server Architecture Exchange 2013 includes fundamental changes to the search and store components and data is processed and rendered. The old content indexing service was replaced with Search Foundation. Search Foundation is an actively developed search platform that is used across the Office Server products. Search Foundation allows us to have notification-driven content indexing which improves indexing performance; in addition, we now annotate during transport, reducing the number of times a message must be indexed significantly. The monolithic store.exe process was re-written; store is now written in managed code and there are now at least three processes that make up the Information Store service: The Microsoft Exchange Replication service, the Information Store service process controller, and the Information store worker process. By utilizing the worker process model, each database is now isolated from every other database (e.g., a database crashing due to a malformed message will not bring down the rest of the databases on the server). In addition, there is a core shift in the server role architecture such that the protocol responsible for servicing a user’s request is the protocol instance that is local to the user’s active mailbox database copy. This means that the Mailbox server role now performs more work when compared to its Exchange 2010 counterpart. The end result is that with the server architecture changes we introduced in Exchange 2013, search, store, and the protocols typically can be CPU and memory bound, as opposed to disk IO or capacity bound. Commodity Hardware As discussed in our server sizing guidance, we are big fans of commodity server hardware. Office 365 is designed to run on commodity hardware that leverages 2 processor sockets and 12 disks – we do not leverage external storage chassis as this increases the operational complexity in the environment. Our Exchange 2013 Mailbox servers have less than 50 database copies per-server in Office 365. Testing The last reason as to why we limited support to 50 databases per-server is that we did not have actual deployments at any scale to validate that store, search, the protocols, and Managed Availability could handle 100 databases per-server. Automation and lab testing can only take you so far; the lack of real world usage was one of the key reasons why we chose to limit the database count. Moving Forward The Exchange Product Group takes pride in the feedback mechanisms we have invested in with the Exchange community. Since the release of Exchange 2013, we’ve received an inordinate amount of feedback regarding the reduction in supported databases per-server. The driving response has been “we currently deploy more than 50 databases per-server in Exchange 2010; with this change, this means we will need to deploy more servers, which increases our capital expenditures significantly.” Rest assured, that is not the message we want with Exchange 2013. It is true that Exchange 2013 utilizes more CPU and memory than its predecessors – this is due to the architecture changes we’ve made, as well as the changes we’ve made to reduce disk IO, so that you can deploy more mailboxes per disk. But we do not want to see architectures artificially limited by the supported databases per-server constraint. Over the last several months, we’ve been working to resolve our concerns and improve our test matrices to validate supporting more databases/server. As a result of the work done by the Mailbox Intelligence team and Operations teams, I am pleased to announce that when Exchange Server 2013 RTM Cumulative Update 2 (CU2) releases we are increasing the number of databases per-server back to 100. Both the Exchange 2013 Server Role Calculator and our sizing guidance will be updated to include this architectural change in tandem with CU2’s release. CU2 will release later this summer. As always, we continue to identify ways to better serve your needs through our regular servicing releases. We hope you find this architectural change useful. Please keep the feedback coming, we are listening. Ross Smith IV Principal Program Manager Exchange Customer Experience48KViews0likes9CommentsAnalyzing Exchange Transaction Log Generation Statistics
Update 1/31/2017: Please see the updated version of this post that explains a significant update to this script. To download the script, see the attachment to this blog post. Overview When designing a site resilient Exchange Server solution, one of the required planning tasks is to determine how many transaction logs are generated on an hourly basis. This helps figure out how much bandwidth will be required when replicating database copies between sites, and what the effects will be of adding additional database copies to the solution. If designing an Exchange solution using the Exchange Server Role Requirements Calculator, the percent of logs generated per hour is an optional input field. Previously, the most common method of collecting this data involved taking captures of the files in each log directory on a scheduled basis (using dir, Get-ChildItem, or CollectLogs.vbs). Although the log number could be extracted by looking at the names of the log files, there was a lot of manual work involved in figuring out the highest the log generation from each capture, and getting rid of duplicate entries. Once cleaned up, the data still had to be analyzed manually using a spreadsheet or a calculator. Trying to gather data across multiple servers and databases further complicated matters. To improve upon this situation, I decided to write an all-in-one script that could collect transaction log statistics, and analyze them after collection. The script is called GetTransactionLogStats.ps1. It has two modes: Gather and Analyze. Gather mode is designed to be run on an hourly basis, on the top of the hour. When run, it will take a single set of snapshots of the current log generation number for all configured databases. These snapshots will be sent, along with the time the snapshots were taken, to an output file, LogStats.csv. Each subsequent time the script is run in Gather mode, another set of snapshots will be appended to the file. Analyze mode is used to process the snapshots that were taken in Gather mode, and should be run after a sufficient amount of snapshots have been collected (at least 2 weeks of data is recommended). When run, it compares the log generation number in each snapshot to the previous snapshot to determine how many logs were created during that period. Script Features Less Data to Collect Instead of looking at the files within log directories, the script uses Perfmon to get the current log file generation number for a specific database or storage group. This number, along with the time it was obtained, is the only information kept in the output log file, LogStats.csv. The performance counters that are used are as follows: Exchange 2013/2016 MSExchangeIS HA Active Database\Current Log Generation Number Exchange 2010 MSExchange Database ==> Instances\Log File Current Generation Note: The counter used for Exchange 2013/2016 contains the active databases on that server, as well as any now passive databases that had been activated on that server at some point since the last reboot. The counter used for Exchange 2010 contains all databases on that server, including all passive copies. To only get data from active databases, make sure to manually specify the databases for that server in the TargetServers.txt file. Alternately you can use the DontAnalyzeInactiveDatabases parameter when performing the analysis to exclude databases that did not increment their log count. Multi Server/Database Support The script takes a simple input file, TargetServers.txt, where each line in the file specifies the server, or server and databases to process. If you want to get statistics for all databases on a server, only the server name is necessary. If you want to only get a subset of databases on a server (for instance if you wanted to omit secondary copies on an Exchange 2010 server), then you can specify the server name, followed by each database you want to process. Built In Analysis Capability The script has the ability to analyze the output log file, LogStats.csv, that was created when run in Gather mode. It does a number of common calculations for you, but also leaves the original data in case any other calculations need to be done. Output from running in Analyze mode is sent to multiple .CSV files, where one file is created for each database, and one more file is created containing the average statistics for all analyzed databases. The following columns are added to the CSV files: Hour: The hour that log stats are being gathered for. Can be between 0 – 23. TotalLogsCreated: The total number of logs created during that hour for all days present in LogStats.csv. TotalSampleIntervalSeconds: The total number of seconds between each valid pair of samples for that hour. Because the script gathers Perfmon data over the network, the sample interval may not always be exactly one hour. NumberOfSamples: The number of times that the log generation was sampled for the given hour. AverageSample: The average number of logs generated for that hour, regardless of sample interval size. Formula: TotalLogsCreated / NumberOfSamples. PercentDailyUsage: The percent of all logs that that particular hour accounts for. Formula: LogsCreatedForHour / LogsCreatedForAllHours * 100. PercentDailyUsageForCalc: The ratio of all logs for this hour compared to all logs for all hours. Formula: LogsCreatedForHour / LogsCreatedForAllHours. verageSamplePer60Minutes: Similar to AverageSample, but adjusts the value like each sample was taken exactly 60 minutes apart. Formula: TotalLogsCreated / TotalSampleIntervalSeconds * 3600. Database Heat Map As of version 2.0, this script now also generates a database heat map when run in Analyze mode. The heat map shows how many logs were generated for each database during the duration of the collection. This information can be used to figure out if databases, servers, or entire Database Availability Groups, are over or underutilized compared to their peers. The database heat map consists of two files: HeatMap-AllCopies.csv: A heat map of all tracked databases, including databases that may have failed over during the collection duration, and were tracked on multiple servers. This heat map shows the server specific instance of each database. Example: HeatMap-DBsCombined.csv: A heat map containing only a single instance of each unique database. In cases where multiple copies of the same database had generated logs, the log count from each will be combined into a single value. Example: Requirements The script has the following requirements; Target Exchange Servers must be running Exchange 2010, 2013, or 2016 PowerShell Remoting must be enabled on the target Exchange Servers, and configured to allow connections from the machine where the script is being executed. Parameters The script has the following parameters: -Gather: Switch specifying we want to capture current log generations. If this switch is omitted, the -Analyze switch must be used. -Analyze: Switch specifying we want to analyze already captured data. If this switch is omitted, the -Gather switch must be used. -ResetStats: Switch indicating that the output file, LogStats.csv, should be cleared and reset. Only works if combined with –Gather. -WorkingDirectory: The directory containing TargetServers.txt and LogStats.csv. If omitted, the working directory will be the current working directory of PowerShell (not necessarily the directory the script is in). -LogDirectoryOut: The directory to send the output log files from running in Analyze mode to. If omitted, logs will be sent to WorkingDirectory. -MaxSampleIntervalVariance: The maximum number of minutes that the duration between two samples can vary from 60. If we are past this amount, the sample will be discarded. Defaults to a value of 10. -MaxMinutesPastTheHour: How many minutes past the top of the hour a sample can be taken. Samples past this amount will be discarded. Defaults to a value of 15. -MonitoringExchange2013: Whether there are Exchange 2013/2016 servers configured in TargetServers.txt. Defaults to $true. If there are no 2013/2016 servers being monitored, set this to $false to increase performance. -DontAnalyzeInactiveDatabases: When running in Analyze mode, this specifies that any databases that have been found that did not generate any logs during the collection duration will be excluded from the analysis. This is useful in excluding passive databases from the analysis. Usage Runs the script in Gather mode, taking a single snapshot of the current log generation of all configured databases: PS C:\> .\GetTransactionLogStats.ps1 -Gather Runs the script in Gather mode, and indicates that no Exchange 2013/2016 servers are configured in TargetServers.txt: PS C:\> .\GetTransactionLogStats.ps1 -Gather -MonitoringExchange2013 $false Runs the script in Gather mode, and changes the directory where TargetServers.txt is located, and where LogStats.csv will be written to: PS C:\> .\GetTransactionLogStats.ps1 -Gather -WorkingDirectory "C:\GetTransactionLogStats" -ResetStats Runs the script in Analyze mode: PS C:\> .\GetTransactionLogStats.ps1 -Analyze Runs the script in Analyze mode, and excludes database copies that did not generate any logs during the collection duration: PS C:\> .\GetTransactionLogStats.ps1 -Analyze -DontAnalyzeInactiveDatabases $true Runs the script in Analyze mode, sending the output files for the analysis to a different directory. Specifies that only sample durations between 55-65 minutes are valid, and that each sample can be taken a maximum of 10 minutes past the hour before being discarded: PS C:\> .\GetTransactionLogStats.ps1 -Analyze -LogDirectoryOut "C:\GetTransactionLogStats\LogsOut" -MaxSampleIntervalVariance 5 -MaxMinutesPastTheHour 10 Example TargetServers.txt The following example shows what the TargetServers.txt input file should look like. For the server1 and server3 lines, no databases are specified, which means that all databases on the server will be sampled. For the server2 and server4 lines, we will only sample the specified databases on those servers. Note that no quotes are necessary for databases with spaces in their names. Output File After Running in Gather Mode When run in Gather mode, the log generation snapshots that are taken are sent to LogStats.csv. The following shows what this file looks like: Output File After Running in Analyze Mode The following shows the analysis for a single database after running the script in Analyze mode: Running As a Scheduled Task Since the script is designed to be run an hourly basis, the easiest way to accomplish that is to run the script via a Scheduled Task. The way I like to do that is to create a batch file which calls Powershell.exe and launches the script, and then create a Scheduled Task which runs the batch file. The following is an example of the command that should go in the batch file: powershell.exe -noninteractive -noprofile -command "& {C:\LogStats\GetTransactionLogStats.ps1 -Gather -WorkingDirectory C:\LogStats}" In this example, the script, as well as TargetServers.txt, are located in C:\LogStats. Note that I specified a WorkingDirectory of C:\LogStats so that if the Scheduled Task runs in an alternate location (by default C:\Windows\System32), the script knows where to find TargetServers.txt and where to write LogStats.csv. Also note that the command does not load any Exchange snapin, as the script doesn’t use any Exchange specific commands. Notes The following information only applies to versions of this script older than 2.0: By default, the Windows Firewall on an Exchange 2013 server running on Windows Server 2012 does not allow remote Perfmon access. I suspect this is also the case with Exchange 2013 running on Windows Server 2008 R2, but haven’t tested. If either of the below errors are logged, you may need to open the Windows Firewall on these servers to allow access from the computer running the script. ERROR: Failed to read perfmon counter from server SERVERNAME ERROR: Failed to get perfmon counters from server SERVERNAME Update: After noticing that multiple people were having issues getting this to work through the Windows Firewall, I tried enabling different combinations of built in firewall rules until I could figure out which ones were required. I only tested on an Exchange 2013 server running on Windows Server 2012, but this should apply to other Windows versions as well. The rules I had to enable were: File and Printer Sharing (NB-Datagram-In) File and Printer Sharing (NB-Name-In) File and Printer Sharing (NB-Session-In) Mike Hendrickson Updates 11/5/2013 added a section on firewall rules to try. 7/17/2014 added a section on running as a scheduled task. 3/28/2016 Version 2.0: Instead of running Get-Counter -ComputerName to remotely access Perfmon counters, the script now uses PowerShell Remoting, specifically Invoke-Command -ComputerName, so that all counter collection is done locally on each target server. This significantly speeds up the collection duration. The script now supports using the -Verbose switch to provide information during script execution. Per Thomas Stensitzki's script variation, added in functionality so that DateTime's can be properly parsed on non-English (US) based computers. Added functionality to generate a database heat map based on log usage. 6/22/2016 Version 2.1: When run in -Gather mode, the script now uses Test-WSMan against each target computer to verify Remote PowerShell connectivity prior to doing the log collection. Added new column to log analysis files, PercentDailyUsageForCalc, which allows for direct copy/paste into the Exchange Server Role Requirements Calculator. Additionally, the script will try to ensure that all rows in the column add up to exactly 1 (requires samples from all 24 hours of the day). Significantly increased performance of analysis operations.47KViews0likes22CommentsPublic folders in the new Office
Modern public folders – ready for Office 365 We made significant architectural changes with modern public folders to deliver against the feedback we got from customers: “Having a separate management and replication model is complicated and expensive.” “We don’t get public folders in Office 365 so we cannot migrate to the service.” “We want public folders, don’t take them away!” Built on mailbox infrastructure Building on known mailbox infrastructure greatly simplifies management and backup/restore for public folders. IT administrators no longer need to learn two different management approaches for mailboxes and public folders. Furthermore storage cost can be significantly reduced by leveraging mailbox infrastructure. Future improvements in mailbox storage management will automatically accrue to public folders as well. No more public folder replication required High availability is ensured through the existing high availability service for mailboxes instead of managing public folder replication separately. Public folder replication is a thing of the past. Scaling to Internet scale The new public folders literally scale to Internet scale. As public folders grow, content simply spans out to a new mailbox. In Office 365 a new mailbox will be added automatically, on premises IT will fork out to a new mailbox. Office 365 In Office 365 management and storage is handled by Microsoft. You can keep supporting your public folder business scenarios but outsource management and storage to Microsoft by moving public folder deployments to the cloud. Terabytes of public folder data turns from a cost to a differentiating asset for the business. Figure 1: Public folders in Office 365 Customer Preview The architectural work and redesign we did for public folders in the new Office is really focused on the IT admin. Management and scalability is much improved over the previous architecture and public folders are now ready for the service while we kept the end user experience the same for this release. Modern public folder architecture Core concepts Here are some of the core concepts of the modern public folder architecture. We will talk though their impact in the following paragraphs Hierarchy Each public folder mailbox has a copy of the public folder hierarchy There is only one writeable copy of the hierarchy at any given time Clients connect to their home hierarchy Content Public folder content is stored in a public folder mailbox It is not replicated across multiple public folder mailboxes (although passive copies of the mailbox are supported through high availability services) All clients access the same public folder mailbox for a given set of content Let’s discuss next what those core concepts mean for some of the key IT scenarios. Scaling out public folders As public folders grow, the content will outgrow capacity of the existing mailbox(es). IT will have a script to split off a branch of the public folder tree and have its content be moved and stored in a new public folder mailbox (remember: that new mailbox will also host a full copy of the public folder hierarchy). In Office 365 the content will be automatically branched out into a new mailbox as the threshold is reached. Scale out flow Existing public folder deployment with two mailboxes. Content for the folders ‘Contoso’ and ‘Sales’ is stored in mailbox 1, content for “Marketing’ and “Finance’ is stored in mailbox 2. The complete hierarchy is stored in each mailbox. As content in mailbox 2 grows, the ‘Finance’ folder will be branched out to a new mailbox. New content for the ‘Finance’ folder or new folders under that branch will be stored in mailbox 3. It’s completely transparent for end users whether content is sto red in one mailbox or another. Availability and redundancy Modern public folders build on mailbox infrastructure and leverage the same mechanisms for availability and redundancy. Every public folder mailbox can have multiple redundant copies with automatic failover in the case of failures. Failover flow Public folder mailboxes have copies for high availability. If one of the mailboxes fails, a passive copy will automatically take over. Client access and geo scale Clients connect to the datacenter through their closest Client Access Sever which will proxy them to the public folder mailbox with their home hierarchy. Access to public folder mailboxes is routed within the high bandwidth corporate network or Office 365. Client home hierarchies are equally distributed across all public folder mailboxes for load balancing. If needed admins can override this per user setting to optimize for proximity. This optimizes for proximity to the client as well as for partitioning access load to any given copy of the hierarchy. Client access flow across datacenters A client in Europe connects to the closest datacenter to his location (usually that datacenter will be hosted in the same geography as the user but let’s say for a moment the user is travelling). The Client Access Server will reroute him to the public folder mailbox with his home hierarchy. The connection will be made within the datacenter network avoiding slow client connections to remote datacenters over the Internet. When that client accesses the ‘Finance’ folder, he will be redirected to the mailbox hosting that content. Writing to the public folder hierarchy There is only one writeable copy of the hierarchy within a public folder deployment. Writes against the hierarchy (e.g. new folder) will always be performed against that writeable copy. In the case that the mailbox hosting the writeable copy fails, the writeable hierarchy role will automatically fail over to a passive mailbox copy in the DAG. What is the difference between site mailboxes, shared mailboxes and public folders? Site mailboxes For groups of people that are working together on a shared set of deliverables. They want to keep important emails and documents in one place. The content is scoped to a particular project that a small team is working on. As such, all content in that mailbox is highly relevant to the team members. User will not see a site mailbox in their Outlook client unless they are an owner or member of that site mailbox. Shared mailboxes A group of people is working on behalf of a virtual entity (e.g. help@contoso.com). They are triaging incoming emails against a shared inbox and responding on behalf of the virtual entity. Integrated document collaboration is not a requirement for this scenario. Users will usually only do this for one shared mailbox and the mailbox is added manually to the user’s Outlook profile. Public folders Public folders hold the full body of shared email knowledge in an organization. Public folders are a great technology for distribution group (DG) archiving. A public folder can be mail enabled and added to the DG. Emails that are sent to the DG will be automatically added to the public folder for later reference. With the new Office public folders are now also available in Office 365. Distribution groups Distribution groups are not actually a shared store in Exchange. They are rather a way for sending emails to a defined set of people such that emails are delivered to those users’ inboxes for triage. Summary We redesigned modern public folders and moved them to a new architecture that scales well into the future. Public folders are now also available in Office 365. Public folders support the same high availability strategies as regular mailboxes. They are massively scalable by branching content out into a new mailbox. Building on the mailbox infrastructure r educes storage cost for customers. Stay tuned This was just a quick overview of some of the important changes in modern public folders. This blog will have more detailed technical posts on public folders coming soon. Some of the topics we will cover are: Operating Modern Public Folders Managing Modern Public Folders Modern Public Folders for developers Public Folders in Office 365 Alfons Staerk, Nikhil Aggarwal37KViews0likes15CommentsUsing an Azure VM as a DAG Witness Server
Update 8/13/15: Since publication of this blog post, there have been change in our Azure support statement related to Exchange Server. Please see KB 2721672 for more information. I’m happy to announce support for use of an Azure virtual machine as an Exchange 2013 Database Availability Group witness server. Automatic datacenter failover in Exchange 2013 requires three physical sites, but many of our customers with stretched DAGs only have two physical sites deployed today. By enabling the use of Azure as a third physical site, this provides many of our customers with a cost-effective method for improving the overall availability and resiliency of their Exchange deployment. You can learn more about the deployment and configuration process, as well as learn about our best practices in the TechNet Library article. It’s important to remember that deployment of production Exchange servers is still unsupported on Azure virtual machines, so it’s not yet possible to stretch a DAG into Azure. This announcement is limited to deployment of a file share witness in the Azure cloud. Also note that this is not related to the “Cloud Witness” feature in the Windows Server Technical Preview. Stay tuned for future announcements about additional support for Azure deployment scenarios. Jeff Mealiffe Principal PM Manager Office 365 Customer Experience36KViews0likes4Comments