storage
1050 TopicsAsk the Perf Guy: Sizing Exchange 2013 Deployments
Since the release to manufacturing (RTM) of Exchange 2013, you have been waiting for our sizing and capacity planning guidance. This is the first official release of our guidance in this area, and updates to our TechNet content will follow in a future milestone. As we continue to learn more from our own internal deployments of Exchange 2013, as well as from customer feedback, you will see further updates to our sizing and capacity planning guidance in two forms: changes to the numbers mentioned in this document, as well as further guidance on specific areas not covered here. Let us know what you think we are missing and we will do our best to respond with better information over time. First, some context Historically, the Exchange Server product group has used various sources of data to produce sizing guidance. Typically, this data would come from scale tests run early in the product development cycle, and we would then fine-tune that guidance with observations from production deployments closer to final release. Production deployments have included Exchange Dogfood (our internal pre-release deployment that hosts the Exchange team and various other groups at Microsoft), Microsoft IT’s corporate Exchange deployment, and various early adopter programs. For Exchange 2013, our guidance is primarily based on observations from the Exchange Dogfood deployment. Dogfood hosts some of the most demanding Exchange users at Microsoft, with extreme messaging profiles and many client sessions per user across multiple client types. Many users in the Dogfood deployment send and receive more than 500 messages per day, and typically have multiple Outlook clients and multiple mobile devices simultaneously connected and active. This allows our guidance to be somewhat conservative, taking into account additional overhead from client types that we don’t regularly see in our internal deployments as well as client mixes that might be different from what's considered “normal” at Microsoft. Does this mean that you should take this conservative guidance and adjust the recommendations such that you deploy less hardware? Absolutely not. One of the many things we have learned from operating our own very high-scale service is that availability and reliability are very dependent on having capacity available to deal with those unexpected peaks. Sizing is both a science and an art form. Attempting to apply too much science to the process (trying to get too accurate) usually results in not having enough extra capacity available to deal with peaks, and in the end, results in a poor user experience and decreased system availability. On the other hand, there does need to be some science involved in the process, otherwise it’s very challenging to have a predictable and repeatable methodology for sizing deployments. We strive to achieve the right balance here. Impact of the new architecture From a sizing and performance perspective, there are a number of advantages with the new Exchange 2013 architecture. As many of you are aware, a couple of years ago we began recommending multi-role deployment for Exchange 2010 (combining the Mailbox, Hub Transport, and Client Access Server (CAS) roles on a single server) as a great way to take advantage of hardware resources on modern servers, as well as a way to simplify capacity planning and deployment. These same advantages apply to the Exchange 2013 Mailbox role as well. We like to think of the services running on the Mailbox role as providing a balanced utilization of resources rather than having a set of services on a role that are very disk intensive, and a set of services on another role that are very CPU intensive. Another example to consider for the Mailbox role is cache effectiveness. Software developers use in-memory caching to prevent having to use higher-latency methods to retrieve data (like LDAP queries, RPC s, or disk reads). In the Exchange 2007/2010 architecture, processing for operations related to a particular user could occur on many servers throughout the topology. One CAS might be handling Outlook Web App for that user, while another (or more than one) CAS might be handling Exchange ActiveSync connections, and even more CAS might be processing Outlook Anywhere RPC proxy load for that same user. It’s even possible that the set of servers handling that load could be changing on a regular basis. Any data associated with that user stored in a cache would become useless (effectively a waste of memory) as soon as those connections moved to other servers. In the Exchange 2013 architecture, all workload processing for a given user occurs on the Mailbox server hosting the active copy of that user’s mailbox. Therefore, cache utilization is much more effective. The new CAS role has some nice benefits as well. Given that the role is totally stateless from a user perspective, it becomes very easy to scale up and down as demands change by simply adding or removing servers from the topology. Compared to the CAS role in prior releases, hardware utilization is dramatically reduced meaning that fewer CAS role machines will be required. Additionally, it may make sense for many customers to consider a multi-role deployment in which CAS and Mailbox are co-located – this allows further simplification of capacity planning and deployment, and also increases the number of available CAS which has a positive effect on service availability. Look for a follow up post on the benefits of a multi-role deployment soon. Start to finish, what’s the process? Sizing an Exchange deployment has six major phases, and I will go through each of them in this post in some detail. You begin the process by making sure you fully understand the available guidance on this topic. If you are reading this post, that’s a great start. There may have been updates posted either here on the Exchange team blog, or over on TechNet. Make sure you take a look before proceeding. The second step is to gather any available data on the existing messaging deployment (if there is one) or estimate user profile requirements if this is a totally new solution. The third step is perhaps the most difficult. At this point, you need to figure out all of the requirements for the Exchange solution that might impact the sizing process. This can include decisions like the desired mailbox size (mailbox quota), service level objectives, number of sites, number of mailbox database copies, storage architecture, growth plans, deployment of 3 rd party products or line-of-business applications, etc. Essentially, you need to understand any aspect of the design that could impact the number of servers, user count, and utilization of servers. Once you have collected all of the requirements, constraints, and user profile data, it’s time to calculate Exchange requirements. The easiest way to do this is with the calculator tool, but it can also be done manually as I will describe in this post. Clearly the calculator makes the process much easier, so if the calculator is available, use it! Once the Exchange requirements have been calculated, it’s time to consider various options that are available. For example, there may be a choice between scaling up (deploying fewer larger servers) and scaling out (deploying a larger number of smaller servers), and the options could have various implications on high availability, as well as the total number of hardware or software failures that the solution can sustain while remaining available to users. Another typical decision is around storage architecture, and this often comes down to cost. There are a range of costs and benefits to different storage choices, and the Exchange requirements can often be met by more than one of these options. The last step is to finalize the design. At this point, it’s time to document all of the decisions that were made, order some hardware, use Jetstress to validate that the storage requirements can be met, and perform any other necessary pre-production lab testing to ensure that the production rollout and implementation will go smoothly. Gather requirements and user data The primary input to all of the calculations that you will perform later is the average user profile of the deployment, where the user profile is defined as the sum of total messages sent and total messages received per-user, per-workday (on average). Many organizations have quite a bit of variability in user profiles. For example, a segment of users might be considered “Information Workers” and spend a good part of their day in their mailbox sending and reading mail, while another segment of users might be more focused on other tasks and use email infrequently. Sizing for these segments of users can be accomplished by either looking at the entire system using weighted averages, or by breaking up the sizing process to align with the various segments of users. In general it’s certainly easier to size the whole system as a unit, but there may be specific requirements (like the use of certain 3 rd party tools or devices) which will significantly impact the sizing calculation for one or more of the user segments, and it can be very difficult to apply sizing factors to a user segment while attempting to size the entire solution as a unit. The obvious question in your mind is how to go get this user profile information. If you are starting with an existing Exchange deployment, there are a number of options that can be used, assuming that you aren’t the elusive Exchange admin who actually tracks statistics like this on an ongoing basis. If you are using Exchange 2007 or earlier, you can utilize the Exchange Profile Analyzer (EPA) tool, which will provide overall user profile statistics for your Exchange organization as well as detailed per-user statistics if required. If you are on Exchange 2010, the EPA tool is not an option for you. One potential option is to evaluate message traffic using performance counters to come up with user profile averages on a per-server basis. This can be done by monitoring the MSExchangeIS\Messages Submitted/sec and MSExchangeIS\Messages Delivered/sec counters during peak average periods and extrapolating the recorded data to represent daily per-user averages. I will cover this methodology in a future blog post, as it will take a fair amount of explanation. Another option is to use message tracking logs to generate these statistics. This could be done via some crafty custom PowerShell scripting, or you could look for scripts that attempt to do this work for you already. One of our own consultants points to an example on his blog. Typical user profiles range from 50-500 messages per-user/per-day, and we provide guidance for those profiles. When in doubt, round up. The other important piece of profile information for sizing is the average message size seen in the deployment. This can be obtained from EPA, or from the other mentioned methods (via transport performance counters, or via message tracking logs). Within Microsoft, we typically see average message sizes of around 75KB, but we certainly have worked with customers that have much higher average message sizes. This can vary greatly by industry, and by region. Start with the Mailbox servers Just as we recommended for Exchange 2010, the right way to start with sizing calculations for Exchange 2013 is with the Mailbox role. In fact, those of you who have sized deployments for Exchange 2010 will find many similarities with the methodology discussed here. Example scenario Throughout this article, we will be referring to an example deployment. The deployment is for a relatively large organization with the following attributes: 100,000 mailboxes 200 message/day profile, with 75KB average message size 10GB mailbox quota Single site 4 mailbox database copies, no lagged copies 2U commodity server hardware platform with internal drive bays and an external storage chassis will be used (total of 24 available large form-factor drive bays) 7200 RPM 4TB midline SAS disks are used Mailbox databases are stored on JBOD direct attached storage, utilizing no RAID Solution must survive double failure events High availability model The first thing you need to determine is your high availability model, e.g., how you will meet the availability requirements that you determined earlier. This likely includes multiple database copies in one or more Database Availability Groups, which will have an impact on storage capacity and IOPS requirements. The TechNet documentation on this topic provides some background on the capabilities of Exchange 2013 and should be reviewed as part of the sizing process. At a minimum, you need to be able to answer the following questions: Will you deploy multiple database copies? How many database copies will you deploy? Will you have an architecture that provides site resilience? What kind of resiliency model will you deploy? How will you distribute database copies? What storage architecture will you use? Capacity requirements Once you have an understanding of how you will meet your high availability requirements, you should know the number of database copies and sites that will be deployed. Given this, you can begin to evaluate capacity requirements. At a basic level, you can think of capacity requirements as consisting of storage for mailbox data (primarily based on mailbox storage quotas), storage for database log files, storage for content indexing files, and overhead for growth. Every copy of a mailbox database is a multiplier on top of these basic storage requirements. As a simplistic example, if I was planning for 500 mailboxes of 1GB each, the storage for mailbox data would be 500GB, and then I would need to apply various factors to that value to determine the per-copy storage requirement. From there, if I needed 3 copies of the data for high availability, I would then need to multiply by 3 to obtain the overall capacity requirement for the solution (all servers). In reality, the storage requirements for Exchange are far more complex, as you will see below. Mailbox size To determine the actual size of a mailbox on disk, we must consider 3 factors: the mailbox storage quota, database white space, and recoverable items. The mailbox storage quota is what most people think of as the “size of the mailbox” – it’s the user perceived size of their mailbox and represents the maximum amount of data that the user can store in their mailbox on the server. While this is certainly represents the majority of space utilization for Exchange databases, it’s not the only element by which we have to size. Database whitespace is the amount of space in the mailbox database file that has been allocated on disk but doesn’t contain any in-use database pages. Think of it as available space to grow into. As content is deleted out of mailbox databases and eventually removed from the mailbox recoverable items, the database pages that contained that content become whitespace. We recommend planning for whitespace size equal to 1 day worth of messaging content. Estimated Database Whitespace per Mailbox = per-user daily message profile x average message size This means that a user with the 200 message/day profile and an average message size of 75KB would be expected to consume the following whitespace: 200 messages/day x 75KB = 14.65MB When items are deleted from a mailbox, they are really “soft-deleted” and moved temporarily to the recoverable items folder for the duration of the deleted item retention period. Like Exchange 2010, Exchange 2013 has a feature known as single item recovery which will prevent purging data from the recoverable items folder prior to reaching the deleted item retention window. When this is enabled, we expect to see a 1.2 percent increase in mailbox size for a 14 day deleted item retention window. Additionally, we expect to see a 3 percent increase in the size of the mailbox for calendar item version logging which is enabled by default. Given that a mailbox will eventually reach a steady state where the amount of new content will be approximately equal to the amount of deleted content in order to remain under quota, we would expect the size of the items in the recoverable items folder to eventually equal the size of new content sent & received during the retention window. This means that the overall size of the recoverable items folder can be calculated as follows: Recoverable Items Folder Size = (per-user daily message profile x average message size x deleted item retention window) + (mailbox quota size x 0.012) + (mailbox quota size x 0.03) If we carry our example forward with the 200 message/day profile, a 75KB average message size, a deleted item retention window of 14 days, and a mailbox quota of 10GB, the expected recoverable items folder size would be: (200 messages/day x 75KB x 14 days) + (10GB x 0.012) + (10GB x 0.03) = 210,000KB + 125,819.12K + 314,572.8KB = 635.16MB Given the results from these calculations, we can sum up the mailbox capacity factors to get our estimated mailbox size on disk: Mailbox Size on disk = 10GB mailbox quota + 14.65MB database whitespace + 635.16MB Recoverable Items Folder = 10.63GB Content indexing The space required for files related to the content indexing process can be estimated as 20% of the database size. Per-Database Content Indexing Space = database size x 0.20 In addition, you must additionally size for one additional content index (e.g. an additional 20% of one of the mailbox databases on the volume) in order to allow content indexing maintenance tasks (specifically the master merge process) to complete. The best way to express the need for the master merge space requirement would be to look at the average database file size across all databases on a volume and add 1 database worth of disk consumption to the calculation when determining the per-volume content indexing space requirement: Per-Volume Content Indexing Space = (average database size x (databases on the volume + 1) x 0.20) As a simple example, if we had 2 mailbox databases on a single volume and each database consumed 100GB of space, we would compute the per-volume content indexing space requirement like this: 100GB database size x (2 databases + 1) x 0.20 = 60GB Log space The amount of space required for ESE transaction log files can be computed using the same method as Exchange 2010. You can find details on the process in the Exchange 2010 TechNet guidance. To summarize the process, you must first determine the base guideline for number of transaction logs generated per-user, per-day, using the following table. As in Exchange 2010, log files are 1MB in size, making the math for log capacity quite straightforward. Message profile (75 KB average message size) Number of transaction logs generated per day 50 10 100 20 150 30 200 40 250 50 300 60 350 70 400 80 450 90 500 100 Once you have the appropriate value from the table which represents guidance for a 75KB average message size, you may need to adjust the value based on differences in the target average message size. Every time you double the average message size, you must increase the logs generated per day by an additional factor of 1.9. For example: Transaction logs at 200 messages/day with 150KB average message size = 40 logs/day (at 75KB average message size) x 1.9 = 76 Transaction logs at 200 messages/day with 300KB average message size = 40 logs/day (at 75KB average message size) x (1.9 x 2) = 152 While daily log volume is interesting, it doesn’t represent the entire requirement for log capacity. If traditional backups are being used, logs will remain on disk for the interval between full backups. When mailboxes are moved, that volume of change to the target database will result in a significant increase in the amount of logs generated during the day. In a solution where Exchange native data protection is in use (e.g., you aren’t using traditional backups), logs will not be truncated if a mailbox database copy is failed or if an entire server is unreachable unless an administrator intervenes. There are many factors to consider when sizing for required log capacity, and it is certainly worth spending some time in the Exchange 2010 TechNet guidance mentioned earlier to fully understand these factors before proceeding. Thinking about our example scenario, we could consider log space required per database if we estimate the number of users per database at 65. We will also assume that 1% of our users are moved per week in a single day, and that we will allocate enough space to support 3 days of logs in the case of failed copies or servers. Log Capacity to Support 3 Days of Truncation Failure = (65 mailboxes/database x 40 logs/day x 1MB log size) x 3 days = 7.62GB Log Capacity to Support 1% mailbox moves per week = 65 mailboxes/database x 0.01 x 10.63GB mailbox size = 6.91GB Total Local Capacity Required per Database = 7.62GB + 6.91GB = 14.53GB Putting all of the capacity requirements together The easiest way to think about sizing for storage capacity without having a calculator tool available is to make some assumptions up front about the servers and storage that will be used. Within the product group, we are big fans of 2U commodity server platforms with ~12 large form-factor drive bays in the chassis. This allows for a 2 drive RAID array for the operating system, Exchange install path, transport queue database, and other ancillary files, and ~10 remaining drives to use as mailbox database storage in a JBOD direct attached storage configuration with no RAID. Fill this server up with 4TB SATA or midline SAS drives, and you have a fantastic Exchange 2013 server. If you need even more storage, it’s quite easy to add an additional shelf of drives to the solution. Using the large deployment example and thinking about how we might size this on the commodity server platform, we can consider a server scaling unit that has a total of 24 large form-factor drive bays containing 4TB midline SAS drives. We will use 2 of those drives for the OS & Exchange, and the remaining drive bays will be used for Exchange mailbox database capacity. Let’s use 12 of those drive bays for databases – that leaves 10 remaining drive bays that could contain spares or remain empty. For this sizing exercise, let’s also plan for 4 databases per drive. Each of those drives has a formatted capacity of ~3725GB. The first step in figuring out the number of mailboxes per database is to look at overall capacity requirements for the mailboxes, content indexes, and required free space (which we will set to 5%). To calculate the maximum amount of space available for mailboxes, let’s apply a formula (note that this doesn’t consider space for logs – we will make sure that the volume will have enough space for logs later in the process). First, we can remove our required free space from the available storage on the drive: Available Space (excluding required free space) = Formatted capacity of the drive x (1 – free space) Then we can remove the space required for content indexing. As discussed above, the space required for content indexing will be 20% of the database size, with an additional 20% of one database for content indexing maintenance tasks. Given the additional 20% requirement, we can’t model the overall space requirement as a simple 20% of the remaining space on the volume. Instead we need to compute a new percentage that takes the number of databases per-volume into consideration. Now we can remove the space for content indexing from our available space on the volume: And we can then divide by the number of databases per-volume to get our maximum database size: In our example scenario, we would obtain the following result: Given this value, we can then calculate our maximum users per database (from a capacity perspective, as this may change when we evaluate the IO requirements): Let’s see if that number is actually reasonable given our 4 copy configuration. We are going to use 16-node DAGs for this deployment to take full advantage of the scalability and high-availability benefits of large DAGs. While we have many drives available on our selected hardware platform, we will be limited by the maximum of 50 database copies per-server in Exchange 2013. Considering this maximum and our desire to have 4 databases per volume, we can calculate the maximum number of drives for mailbox database usage as: With 12 database volumes and 4 database copies per-volume, we will have 48 total database copies per server. With 66 users per database and 100,000 total users, we end up with the following required DAG count for the user population: In this very large deployment, we are using a DAG as a unit of scale or “building block” (e.g. we perform capacity planning based on the number of DAGs required to meet demand, and we deploy an entire DAG when we need additional capacity), so we don’t intend to deploy a partial DAG. If we round up to 8 DAGs we can compute our final users per database count: With 65 users per-database, that means we will expect to consume the following space for mailbox databases: Estimated Database Size = 65 users x 10.63GB = 690.95GB Database Consumption / Volume = 690.95GB x 4 databases = 2763.8GB Using the formula mentioned earlier, we can compute our estimated content index consumption as well: 690.95GB database size x (4 databases + 1) x 0.20 = 690.95GB You’ll recall that we computed transaction log space requirements earlier, and it turns out that we magically computed those values with the assumption that we would have 65 users per-database. What a pleasant coincidence! So we will need 14.53GB of space for transaction logs per-database, or to get a more useful result: Log Space Required / Volume = 14.53GB x 4 databases = 58.12GB To sum it up, we can estimate our total per-volume space utilization and make sure that we have plenty of room on our target 4TB drives: Looks like our database volumes are sized perfectly! IOPS requirements To determine the IOPS requirements for a database, we look at the number of users hosted on the database and consider the guidance provided in the following table to compute total required IOPS when the database is active or passive. Messages sent or received per mailbox per day Estimated IOPS per mailbox (Active or Passive) 50 0.034 100 0.067 150 0.101 200 0.134 250 0.168 300 0.201 350 0.235 400 0.268 450 0.302 500 0.335 For example, with 50 users in a database, with an average message profile of 200, we would expect that database to require 50 x 0.134 = 6.7 transactional IOPS when the database is active, and 50 x 0.134 = 6.7 transactional IOPS when the database is passive. Don’t forget to consider database placement which will impact the number of databases with IOPS requirements on a given storage volume (which could be a single JBOD drive or might be a more complex storage configuration). Going back to our example scenario, we can evaluate the IOPS requirement of the solution, recalling that the average user profile in that deployment is the 200 message/day profile. We have 65 users per database and 4 databases per JBOD drive, so we can estimate our IOPS requirement in worst-case (all databases active) as: 65 mailboxes x 4 databases per-drive x 0.134 IOPS/mailbox at 200 messages/day profile = ~34.84 IOPS per drive Midline SAS drives typically provide ~57.5 random IOPS (based on our own internal observations and benchmark tests), so we are well within design constraints when thinking about IOPS requirements. Storage bandwidth requirements While IOPS requirements are usually the primary storage throughput concern when designing an Exchange solution, it is possible to run up against bandwidth limitations with various types of storage subsystems. The IOPS sizing guidance above is looking specifically at transactional (somewhat random) IOPS and is ignoring the sequential IO portion of the workload. One place that sequential IO becomes a concern is with storage solutions that are running a large amount of sequential IO through a common channel. A common example of this type of load is the ongoing background database maintenance (BDM) which runs continuously on Exchange mailbox databases. While this BDM workload might not be significant for a few databases stored on a JBOD drive, it may become a concern if all of the mailbox database volumes are presented through a common iSCSI or Fibre Channel interface. In that case, the bandwidth of that common channel must be considered to ensure that the solution doesn’t bottleneck due to these IO patterns. In Exchange 2013, we expect to consume approximately 1MB/sec/database copy for BDM which is a significant reduction from Exchange 2010. This helps to enable the ability to store multiple mailbox databases on the same JBOD drive spindle, and will also help to avoid bottlenecks on networked storage deployments such as iSCSI. This bandwidth utilization is in addition to bandwidth consumed by the transactional IO activity associated with user and system workload processes, as well as storage bandwidth consumed by the log replication and replay process in a DAG. Transport storage requirements Since transport components (with the exception of the front-end transport component on the CAS role) are now part of the Mailbox role, we have included CPU and memory requirements for transport with the general Mailbox role requirements described later. Transport also has storage requirements associated with the queue database. These requirements, much like I described earlier for mailbox storage, consist of capacity factors and IO throughput factors. Transport storage capacity is driven by two needs: queuing (including shadow queuing) and Safety Net (which is the replacement for transport dumpster in this release). You can think of the transport storage capacity requirement as the sum of message content on disk in a worst-case scenario, consisting of three elements: The current day’s message traffic, along with messages which exist on disk longer than normal expiration settings (like poison queue messages) Queued messages waiting for delivery Messages persisted in Safety Net in case they are required for redelivery Of course, all three of these factors are also impacted by shadow queuing in which a redundant copy of all messages is stored on another server. At this point, it would be a good idea to review the TechNet documentation on Transport High Availability if you aren’t familiar with the mechanics of shadow queuing and Safety Net. In order to figure out the messages per day that you expect to run through the system, you can look at the user count and messaging profile. Simply multiplying these together will give you a total daily mail volume, but it will be a bit higher than necessary since it is double counting messages that are sent within the organization (i.e. a message sent to a coworker will count towards the profile of the sending user as well as the profile of the receiving user, but it’s really just one message traversing the system). The simplest way to deal with that would be to ignore this fact and oversize transport, which will provide additional capacity for unexpected peaks in message traffic. An alternative way to determine daily message flow would be to evaluate performance counters within your existing messaging system. To determine the maximum size of the transport database, we can look at the entire system as a unit and then come up with a per-server value. Overall Daily Messages Traffic = number of users x message profile Overall Transport DB Size = average message size x overall daily message traffic x (1 + (percentage of messages queued x maximum queue days) + Safety Net hold days) x 2 copies for high availability Let’s use the 100,000 user sizing example again and size the transport database using the simple method. Overall Transport DB Size = 75KB x (100,000 users x 200 messages/day) x (1 + (50% x 2 maximum queue days) + 2 Safety Net hold days) x 2 copies = 11,444GB In our example scenario, we have 8 DAGs, each containing 16-nodes, and we are designing to handle double node failures in each DAG. This means that in a worst-case failure event we would have 112 servers online with 2 failed servers in each DAG. We can use this value to determine a per-server transport DB size: Sizing for transport IO throughput requirements is actually quite simple. Transport has taken advantage of many of the IO reduction changes to the ESE database that have been made in recent Exchange releases. As a result, the number of IOPS required to support transport is significantly lower. In the internal deployment we used to produce this sizing guidance, we see approximately 1 DB write IO per message and virtually no DB read IO, with an average message size of ~75KB. We expect that as average message size increases, the amount of transport IO required to support delivery and queuing would increase. We do not currently have specific guidance on what that curve looks like, but it is an area of active investigation. In the meantime, our best practices guidance for the transport database is to leave it in the Exchange install path (likely on the OS drive) and ensure that the drive supporting that directory path is using a protected write cache disk controller, set to 100% write cache if the controller allows optimization of read/write cache settings. The write cache allows transport database log IO to become effectively “free” and allows transport to handle a much higher level of throughput. Processor requirements Once we have our storage requirements figured out, we can move on to thinking about CPU. CPU sizing for the Mailbox role is done in terms of megacycles. A megacycle is a unit of processing work equal to one million CPU cycles. In very simplistic terms, you could think of a 1 MHz CPU performing a megacycle of work every second. Given the guidance provided below for megacycles required for active and passive users at peak, you can estimate the required processor configuration to meet the demands of an Exchange workload. Following are our recommendations on the estimated required megacycles for the various user profiles. Messages sent or received per mailbox per day Mcycles per User, Active DB Copy or Standalone (MBX only) Mcycles per User, Active DB Copy or Standalone (Multi-Role) Mcycles per User, Passive DB Copy 50 2.13 2.93 0.69 100 4.25 5.84 1.37 150 6.38 8.77 2.06 200 8.50 11.69 2.74 250 10.63 14.62 3.43 300 12.75 17.53 4.11 350 14.88 20.46 4.80 400 17.00 23.38 5.48 450 19.13 26.30 6.17 500 21.25 29.22 6.85 The second column represents the estimated megacycles required on the Mailbox role server hosting the active copy of a user’s mailbox database. In a DAG configuration, the required megacycles for the user on each server hosting passive copies of that database can be found in the fourth column. If the solution is going to include multi-role (Mailbox+CAS) servers, use the value in the third column rather than the second, as it includes the additional CPU requirements for the CAS role. It is important to note that while many years ago you could make an assumption that a 500 MHz processor could perform roughly double the work per unit of time as a 250 MHz processor, clock speeds are no longer a reliable indicator of performance. The internal architecture of modern processors is different enough between manufacturers as well as within product lines of a single manufacturer that it requires an additional normalization step to determine the available processing power for a particular CPU. We recommend using the SPECint_rate2006 benchmark from the Standard Performance Evaluation Corporation. The baseline system used to generate this guidance was a Hewlett-Packard DL380p Gen8 server containing Intel Xeon E5-2650 2 GHz processors. The baseline system SPECint_rate2006 score is 540, or 33.75 per-core, given that the benchmarked server was configured with a total of 16 physical processor cores. Please note that this is a different baseline system than what was used to generate our Exchange 2010 guidance, so any tools or calculators that make assumptions based on the 2010 baseline system would not provide accurate results for sizing an Exchange 2013 solution. Using the same general methodology we have recommended in prior releases, you can determine the estimated available Exchange workload megacycles available on a different processor through the following process: Find the SPECint_rate2006 score for the processor that you intend to use for your Exchange solution. You can do this the hard way (described below) or use Scott Alexander’s fantastic Processor Query Toolto get the per-server score and processor core count for your hardware platform. On the website of the Standard Performance Evaluation Corporation, select Results, highlight CPU2006, and select Search all SPECint_rate2006 results. Under Simple Request, enter the search criteria for your target processor, for example Processor Matches E5-2630. Find the server and processor configuration you are interested in using (or if the exact combination is not available, find something as close as possible) and note the value in the Result column and the value in the # Cores column. Obtain the per-core SPECint_rate2006 score by dividing the value in the Result column by the value in the # Cores column. For example, in the case of the Hewlett-Packard DL380p Gen8 server with Intel Xeon E5-2630 processors (2.30GHz), the Result is 430 and the # Cores is 12, so the per-core value would be 430 / 12 = 35.83. To determine the estimated available Exchange workload megacycles on the target platform, use the following formula: Using the example HP platform with E5-2630 processors mentioned previously, we would calculate the following result: x 12 processors = 25,479 available megacycles per-server Keep in mind that a good Exchange design should never plan to run servers at 100% of CPU capacity. In general, 80% CPU utilization in a failure scenario is a reasonable target for most customers. Given that caveat that the high CPU utilization occurs during a failure scenario, this means that servers in a highly available Exchange solution will often run with relatively low CPU utilization during normal operation. Additionally, there may be very good reasons to target a lower CPU utilization as maximum, particularly in cases where unanticipated spikes in load may result in acute capacity issues. Going back to the example I used previously of 100,000 users with the 200 message/day profile, we can estimate the total required megacycles for the deployment. We know that there will be 4 database copies in the deployment, and that will help to calculate the passive megacycles required. We also know that this deployment will be using multi-role (Mailbox+CAS) servers. Given this information, we can calculate megacycle requirements as follows: 100,000 users ((11.69 mcycles per active mailbox) + (3 passive copies x 2.74 mcycles per passive mailbox)) = 1,991,000 total mcycles required You could then take that number and attempt to come up with a required server count. I would argue that it’s actually a much better practice to come up with a server count based on high availability requirements (taking into account how many component failures your design can handle in order to meet business requirements) and then ensure that those servers can meet CPU requirements in a worst-case failure scenario. You will either meet CPU requirements without any additional changes (if your server count is bound on another aspect of the sizing process), or you will adjust the server count (scale out), or you will adjust the server specification (scale up). Continuing with our hypothetical example, if we knew that the high availability requirements for the design of the 100,000 user example resulted in a maximum of 16 databases being active at any time out of 48 total database copies per server, and we know that there are 65 users per database, we can determine the per-server CPU requirements for the deployment. (16 databases x 65 mailboxes x 11.69 mcycles per active mailbox) + (32 databases x 65 mailboxes x 2.74 mcycles per passive mailbox) = 12157.6 + 5699.2 = 17,856.8 mcycles per server Using the processor configuration mentioned in the megacycle normalization section (E5-2630 2.3 GHz processors on an HP DL380p Gen8), we know that we have 25,479 available mcycles on the server, so we would estimate a peak average CPU in worst-case failure of: 17.857 / 25,479 = 70.1% That is below our guidance of 80% maximum CPU utilization (in a worst-case failure scenario), so we would not consider the servers to be CPU bound in the design. In fact, we could consider adjusting the CPU selection to a cheaper option with reduced performance getting us closer to a peak average CPU in worst-case failure of 80%, reducing the cost of the overall solution. Memory requirements To calculate memory per server, you will need to know the per-server user count (both active and passive users) as well as determine whether you will run the Mailbox role in isolation or deploy multi-role servers (Mailbox+CAS). Keep in mind that regardless of whether you deploy roles in isolation or deploy multi-role servers, the minimum amount of RAM on any Exchange 2013 server is 8GB. Memory on the Mailbox role is used for many purposes. As in prior releases, a significant amount of memory is used for ESE database cache and plays a large part in the reduction of disk IO in Exchange 2013. The new content indexing technology in Exchange 2013 also uses a large amount of memory. The remaining large consumers of memory are the various Exchange services that provide either transactional services to end-users or handle background processing of data. While each of these individual services may not use a significant amount of memory, the combined footprint of all Exchange services can be quite large. Following is our recommended amount of memory for the Mailbox role on a per mailbox basis that we expect to be used at peak. Messages sent or received per mailbox per day Mailbox role memory per active mailbox (MB) 50 12 100 24 150 36 200 48 250 60 300 72 350 84 400 96 450 108 500 120 To determine the amount of memory that should be provisioned on a server, take the number of active mailboxes per-server in a worst-case failure and multiply by the value associated with the expected user profile. From there, round up to a value that makes sense from a purchasing perspective (i.e. it may be cheaper to configure 128GB of RAM compared to a smaller amount of RAM depending on slot options and memory module costs). Mailbox Memory per-server = (worst-case active database copies per-server x users per-database x memory per-active mailbox) For example, on a server with 48 database copies (16 active in worst-case failure), 65 users per-database, expecting the 200 profile, we would recommend: 16 x 65 x 48MB = 48.75GB, round up to 64GB It’s important to note that the content indexing technology included with Exchange 2013 uses a relatively large amount of memory to allow both indexing and query processing to occur very quickly. This memory usage scales with the number of items indexed, meaning that as the number of total items stored on a Mailbox role server increases (for both active and passive copies), memory requirements for the content indexing processes will increase as well. In general, the guidance on memory sizing presented here assumes approximately 15% of the memory on the system will be available for the content indexing processes which means that with a 75KB average message size, we can accommodate mailbox sizes of 3GB at 50 message profile up to 32GB at the 500 message profile without adjusting the memory sizing. If your deployment will have an extremely small average message size or an extremely large average mailbox size, you may need to add additional memory to accommodate the content indexing processes. Multi-role server deployments will have an additional memory requirement beyond the amounts specified above. CAS memory is computed as a base memory requirement for the CAS components (2GB) plus additional memory that scales based on the expected workload. This overall CAS memory requirement on a multi-role server can be computed using the following formula: Essentially this is 2GB of memory for the base requirement, plus 2GB of memory for each processor core (or fractional processor core) serving active load at peak in a worst-case failure scenario. Reusing the example scenario, if I have 16 active databases per-server in a worst-case failure and my processor is providing 2123 mcycles per-core, I would need: If we add that to the memory requirement for the Mailbox role calculated above, our total memory requirement for the multi-role server would be: 48.75GB for Mailbox + 5.12GB for CAS = 53.87GB, round up to 64GB Regardless of whether you are considering a multi-role or a split-role deployment, it is important to ensure that each server has a minimum amount of memory for efficient use of the database cache. There are some scenarios that will produce a relatively small memory requirement from the memory calculations described above. We recommend comparing the per-server memory requirement you have calculated with the following table to ensure you meet the minimum database cache requirements. The guidance is based on total database copies per-server (both active and passive). If the value shown in this table is higher than your calculated per-server memory requirement, adjust your per-server memory requirement to meet the minimum listed in the table. Per-Server DB Copies Minimum Physical Memory (GB) 1-10 8 11-20 10 21-30 12 31-40 14 41-50 16 In our example scenario, we are deploying 48 database copies per-server, so the minimum physical memory to provide necessary database cache would be 16GB. Since our computed memory requirement based on per-user guidance including memory for the CAS role (53.87GB) was higher than the minimum of 16GB, we don’t need to make any further adjustments to accommodate database cache needs. Unified messaging With the new architecture of Exchange, Unified Messaging is now installed and ready to be used on every Mailbox and CAS. The CPU and memory guidance provided here assumes some moderate UM utilization. In a deployment with significant UM utilization with very high call concurrency, additional sizing may need to be performed to provide the best possible user experience. As in Exchange 2010, we recommend using a 100 concurrent call per-server limit as the maximum possible UM concurrency, and scale out the deployment if the sizing of your deployment becomes bound on this limit. Additionally, voicemail transcription is a very CPU-intensive operation, and by design will only transcribe messages when there is enough available CPU on the machine. Each voicemail message requires 1 CPU core for the duration of the transcription operation, and if that amount of CPU cannot be obtained, transcription will be skipped. In deployments that anticipate a high amount of voicemail transcription concurrency, server configurations may need to be adjusted to increase CPU resources, or the number of users per server may need to be scaled back to allow for more available CPU for voicemail transcription operations. Sizing and scaling the Client Access Server role In the case where you are going to place the Mailbox and CAS roles on separate servers, the process of sizing CAS is relatively straightforward. CAS sizing is primarily focused on CPU and memory requirements. There is some disk IO for logging purposes, but it is not significant enough to warrant specific sizing guidance. CAS CPU is sized as a ratio from Mailbox role CPU. Specifically, we need to get 37.5% of the megacycles used to support active users on the Mailbox role. You could think of this as a 3:8 ratio (CAS CPU to active Mailbox CPU) compared to the 3:4 ratio we recommended in Exchange 2010. One way to compute this would be to look at the total active user megacycles required for the solution, take 37.5% of that, and then determine the required CAS server count based on high availability requirements and multi-site design constraints. For example, consider the 100,000 user example using the 200 message/day profile: Total CAS Required Mcycles = 100,000 users x 8.5 mcycles x 0.375 = 318,750 mcycles Assuming that we want to target a maximum CPU utilization of 80% and the servers we plan to deploy have 25,479 available megacycles, we can compute the required number of servers quite easily: Obviously we would need to then consider whether the 16 required servers meet our high availability requirements considering the maximum CAS server failures that we must design for given business requirements, as well as the site configuration where some of the CAS servers may be in different sites handling different portions of the workload. Since we specified in our example scenario that we want to survive a double failure in the single site, we would increase our 16 CAS to 18 such that we could sustain 2 CAS failures and still handle the workload. To size memory, we will use the same formula that was used for Exchange 2010: Per-Server CAS Memory = 2GB + 2GB per physical processor core Using the example scenario we have been using, we can calculate the per-server CAS memory requirement as: In this example, 20.77GB would be the guidance for required CAS memory, but obviously you would need to round-up to the next highest possible (or highest performing) memory configuration for the server platform: perhaps 24GB. Active Directory capacity for Exchange 2013 Active Directory sizing remains the same as it was for Exchange 2010. As we gain more experience with production deployments we may adjust this in the future. For Exchange 2013, we recommend deploying a ratio of 1 Active Directory global catalog processor core for every 8 Mailbox role processor cores handling active load, assuming 64-bit global catalog servers: If we revisit our example scenario, we can easily calculate the required number of GC cores required. Assuming that my Active Directory GCs are also deployed on the same server hardware configuration as my CAS & Mailbox role servers in the example scenario with 12 processor cores, then my GC server count would be: In order to sustain double failures, we would need to add 2 more GCs to this calculation, which would take us to 7 GC servers for the deployment. As a best practice, we recommend sizing memory on the global catalog servers such that the entire NTDS.DIT database file can be contained in RAM. This will provide optimal query performance and a much better end-user experience for Exchange workloads. Hyperthreading: Wow, free processors! Turn it off. While modern implementations of simultaneous multithreading (SMT), also known as hyperthreading, can absolutely improve CPU throughput for most applications, the benefits to Exchange 2013 do not outweigh the negative impacts. It turns out that there can be a significant impact to memory utilization on Exchange servers when hyperthreading is enabled due to the way the .NET server garbage collector allocates heaps. The server garbage collector looks at the total number of logical processors when an application starts up and allocates a heap per logical processor. This means that the memory usage at startup for one of our services using the server garbage collector will be close to double with hyperthreading turned on vs. when it is turned off. This significant increase in memory, along with an analysis of the actual CPU throughput increase for Exchange 2013 workloads in internal lab tests has led us to a best practice recommendation that hyperthreading should be disabled for all Exchange 2013 servers. The benefits don’t outweigh the negative impact. There’s an important caveat to this recommendation for customers who are virtualizing Exchange. Since the number of logical processors visible to a virtual machine is determined by the number of virtual CPUs allocated in the virtual machine configuration, hyperthreading will not have the same impact on memory utilization described above. It’s certainly acceptable to enable hyperthreading on physical hardware that is hosting Exchange virtual machines, but make sure that any capacity planning calculations for that hardware are based purely on physical CPUs. Follow the best practice recommendations of your hypervisor vendor on whether or not to enable hyperthreading. Note that the extra logical CPUs that are added when hyperthreading is enabled must not be considered when allocating virtual machine resources during the sizing and deployment process. For example, on a physical host running Hyper-V with 40 physical processor cores and hyperthreading enabled, 80 logical processor cores will be visible to the root operating system. If your Exchange design required 16-core servers, you could place 2 Exchange VMs on the physical host as those 2 VMs would consume 32 physical processor cores without enough physical processor cores to host another 16-core VM (32+16 = 48, which is greater than 40). You are going to give me a calculator, right? Now that you have digested all of this guidance, you are probably thinking about how much more of a pain it will be to size a deployment compared to using the Mailbox Role Requirements Calculator for Exchange 2010. UPDATE: You can now read about and download the calculator from here. Hopefully that leaves you with enough information to begin to properly size your Exchange 2013 deployments. If you have further questions, you can obviously post comments here, but I’d also encourage you to consider attending one of the upcoming TechEd events. I’ll be at TechEd North America as well as TechEd Europe with a session specifically on this topic, and would be happy to answer your questions in person, either in the session or at the “Ask the Experts” event. Recordings of those sessions will also be posted to MSDN Channel9 after the events have concluded. Jeff Mealiffe Principal Program Manager Lead Exchange Customer Experience Updates 4/3/2014: Updated CAS CPU and Memory sizing with SP1 guidance 1/6/2015: Updated the hyperthreading section413KViews0likes28CommentsReleased: Exchange 2013 Server Role Requirements Calculator
To download the calculator, please see the attachment on this post (also note that the calculator can be used for E2016 deployments as per this). It’s been a long road, but the initial release of the Exchange Server Role Requirements Calculator is here. No, that isn’t a mistake, the calculator has been rebranded. Yes, this is no longer a Mailbox server role calculator; this calculator includes recommendations on sizing Client Access servers too! Originally, marketing wanted to brand it as the Microsoft Exchange Server 2013 Client Access and Mailbox Server Roles Theoretical Capacity Planning Calculator, On-Premises Edition. Wow, that’s a mouthful and reminds me of this branding parody. Thankfully, I vetoed that name (you’re welcome!). The calculator supports the architectural changes made possible with Exchange 2013 and later: Client Access Servers Like with Exchange 2010, the recommendation in Exchange 2013 is to deploy multi-role servers. There are very few reasons you would need to deploy dedicated Client Access servers (CAS); CPU constraints, use of Windows Network Load Balancing in small deployments (even with our architectural changes in client connectivity, we still do not recommend Windows NLB for any large deployments) and certificate management are a few examples that may justify dedicated CAS. When deploying multi-role servers, the calculator will take into account the impact that the CAS role has and make recommendations for sizing the entire server’s memory and CPU. So when you see the CPU utilization value, this will include the impact both roles have! When deploying dedicated server roles, the calculator will recommend the minimum number of Client Access processor cores and memory per server, as well as, the minimum number of CAS you should deploy in each datacenter. Transport Now that the Mailbox server role includes additional components like transport, it only makes sense to include transport sizing in the calculator. This release does just that and will factor in message queue expiration and Safety Net hold time when calculating the database size. The calculator even makes a recommendation on where to deploy the mail.que database, either the system disk, or on a dedicated disk! Multiple Databases / JBOD Volume Support Exchange 2010 introduced the concept of 1 database per JBOD volume when deploying multiple database copies. However, this architecture did not ensure that the drive was utilized effectively across all three dimensions – throughput, IO, and capacity. Typically, the system was balanced from an IO and capacity perspective, but throughput was where we saw an imbalance, because during reseeds only a portion of the target disk’s total capable throughput was utilized. In addition, capacity on the 7.2K disks continue to increase with 4TB disks now available, thus impacting our ability to remain balanced along that dimension. In addition, Exchange 2013 includes a 33% reduction in IO when compared to Exchange 2010. Naturally, the concept of 1 database / JBOD volume needed to evolve. As a result, Exchange 2013 made several architectural changes in the store process, ESE, and HA architecture to support multiple databases per JBOD volume. If you would like more information, please see Scott’s excellent TechEd session in a few weeks on Exchange 2013 High Availability and Site Resilience or the High Availability and Site Resilience topic on TechNet. By default, the calculator will recommend multiple databases per JBOD volume. This architecture is supported for single datacenter deployments and multi-datacenter deployments when there is copy and/or server symmetry. The calculator supports highly available database copies and lagged database copies with this volume architecture type. The distribution algorithm will lay out the copies appropriately, as well as, generate the deployment scripts correctly to support AutoReseed. High Availability Architecture Improvements The calculator has been improved in several ways for high availability architectures: You can now specify the Witness Server location, either primary, secondary, or tertiary datacenter. The calculator allows you to simulate WAN failures, so that you can see how the databases are distributed during the worst failure mode. The calculator allows you to name servers and define a database prefix which are then used in the deployment scripts. The distribution algorithm supports single datacenter HA deployments, Active/Passive deployments, and Active/Active deployments. The calculator includes a PowerShell script to automate DAG creation. In the event you are deploying your high availability architecture with direct attached storage, you can now specify the maximum number of database volumes each server will support. For example, if you are deploying a server architecture that can support 24 disks, you can specify a maximum support of 20 database volumes (leaving 2 disks for system, 1 disk for Restore Volume, and 1 disks as a spare for AutoReseed). Additional Mailbox Tiers (sort of!) Over the years, a few, but vocal, members of the community have requested that I add more mailbox tiers to the calculator. As many of you know, I rarely recommend sizing multiple mailbox tiers, as that simply adds operational complexity and I am all about removing complexity in your messaging environments. While, I haven’t specifically added additional mailbox tiers, I have added the ability for you to define a percentage of the mailbox tier population that should have the IO and Megacycle Multiplication Factors applied. In a way, this allows you to define up to eight different mailbox tiers. Processors I’ve received a number of questions regarding processor sizing in the calculator. People are comparing the Exchange 2010 Mailbox Server Role Requirements Calculator output with the Exchange 2013 Server Role Requirements Calculator. As mentioned in our Exchange 2013 Performance Sizing article, the megacycle guidance in Exchange 2013 leverages a new server baseline, therefore, you cannot directly compare the output from the Exchange 2010 calculator with the Exchange 2013 calculator. Conclusion There are many other minor improvements sprinkled throughout the calculator. We hope you enjoy this initial release. All of this work wouldn’t have occurred without the efforts of Jeff Mealiffe (for without our sizing guidance there would be no calculator!), David Mosier (VBA scripting guru and the master of crafting the distribution worksheet), and Jon Gollogy (deployment scripting master). As always we welcome feedback and please report any issues you may encounter while using the calculator by emailing strgcalc AT microsoft DOT com. Ross Smith IV Principal Program Manager Exchange Customer Experience352KViews1like55CommentsExchange 2010 Server Role Requirements Calculator
EDIT: This post has been updated on 5/24/2013 for the new version of calculator. For the list of latest major changes, please see THIS. Overview The Exchange Mailbox Server role is arguably one of the most important roles within an Exchange deployment for it stores the data that users will ultimately access on a daily basis. Therefore, ensuring that you design the mailbox server role correctly is critical to your design. With Exchange 2010 you can deploy a solution that leverages mailbox resiliency and has multiple database copies deployed across datacenters, implements single item recovery for data recovery, and has the flexibility in storage design to allow you to deploy on storage area networks utilizing fibre-channel or SATA class disks or on direct attached storage utilizing SAS or SATA class disks with or without RAID protection. But, in order to design your solution, you need to understand the following criteria: User profile - the message profile, the mailbox size, and the number of users High availability architecture - the number of database copies you plant to deploy, whether the solution will be site resilient, the desired number of mailbox servers Server's CPU platform Storage architecture - the disk capacity / type and storage solution Backup architecture - whether to use hardware or software VSS and the frequency of the backups, or leverage the Exchange native data protection features Network architecture - the utilization, throughput, and latency aspects Previous versions of Exchange were somewhat rigid in terms of the choices you had in designing your mailbox server role. The flexibility in the architecture with Exchange 2010, allows you the freedom to design the solution to meet your needs. Prior to making any decisions, please review the following topics from the Exchange 2010 Online Help: Understanding High Availability and Site Resilience Planning for High Availability and Site Resilience Understanding Backup, Restore and Disaster Recovery Mailbox Server Storage Design Recommendations Performance and Scalability After you have determined the design you would like to implement, you can follow the steps in the Exchange 2010 Mailbox Server Role Design Example article within the Exchange 2010 Online Help to calculate your solution's CPU, memory, and storage requirements, or you can leverage the Exchange 2010 Mailbox Server Role Requirements Calculator. The calculator is broken out into the following sections (worksheets): Input Role Requirements Activation Scenarios Distribution LUN Requirements Backup Requirements Log Replication Requirements Storage Design Important: The data points provided in the calculator are an example configuration. As such any data points entered into the Input worksheet are specific to that particular configuration and do not apply for other configurations. Please ensure you are using the correct data points for your design Input When you launch the Exchange 2010 Mailbox Server Role Requirements Calculator, you are presented with the Input worksheet. This worksheet is broken down into 5 key areas. This section is where you enter in all the relevant information regarding your intended design, so that the calculator can generate what you need in order to achieve it. Note: There are many input factors that need to be accounted for before you can design your solution. Each input factor is briefly listed below; there are additional notes within the calculator that explain them in more detail. Environment Configuration Within Step 1 you will enter in the appropriate information concerning your messaging environment's configuration - the high availability architecture and database copy configuration, the data and I/O configuration, and CPU inputs. Note: For optimal sizing, choose a multiple of the total number of database copies you have selected for the number of mailbox servers. Exchange Environment Configuration What server architecture are you deploying for your global catalogs? You can deploy either the 32-bit or 64-bit architecture for your Active Directory servers. The architecture you deploy will affect your core ratio planning. For more information, please see http://technet.microsoft.com/en-us/library/dd346701.aspx. Do these servers only have the mailbox server role installed? Having the Hub Transport and Client Access server roles also installed on the along with the mailbox server role affects your design in the areas of load balancing client requests, memory utilization, and CPU utilization. Will these servers be deployed as guest machines in a virtualized environment? There is CPU overhead that must be accounted for when deploying guest machines that must be accounted for in the design. For Hyper-V deployments the overhead is about 10%. Check with your hypervisor vendor to determine their overhead and adjust the “Hypervisor CPU Adjustment Factor” accordingly. Are you deploying a database availability group (DAG)? Deploying the solution as DAG provides you additional flexibility and resiliency choices like having multiple mailbox database copies, leveraging flexible mailbox protection features in lieu of traditional backups, and flexibility in your storage architecture (e.g. RAID or JBOD). How many mailbox servers are you going to deploy within the primary datacenter? If you enter more than a single server (remember a DAG requires at least two and can support a maximum of 16), the calculator will evenly distribute the user mailboxes across the total number of mailbox servers and make performance and capacity recommendations for each server, as well as, for the entire environment. As for the secondary datacenter, the calculator will determine the number of mailbox servers you need to deploy there based on the requirements (number of databases, number of copies, etc.). How many DAGs are you planning to deploy in the environment? If you enter more than a single DAG, then the calculator will distribute the user mailboxes across the total number of DAGs and make performance and capacity recommendations for each server and each DAG, as well as, for the entire environment. Site Resilience Configuration Are you deploying the DAG in a site resilient configuration? A DAG can be stretched across 2 or more datacenters (the calculator only allows for 1 datacenter) without requiring the AD site or network subnet to be stretched. What user distribution model will you be leveraging in your site resilient architecture?When planning a site resilience model with Exchange 2010, keep in mind there are two variables that need to be considered: namespace model and user distribution model. For the namespace or datacenter model, Exchange 2010 requires both datacenters to be in an Active/Active configuration. This means that both datacenters participating in the DAG solution must have active, reachable namespaces and have the ability to support active load at any time. For the user distribution model, the design can support both Active/Passive and Active/Active user distribution. The calculator supports three scenarios for these user distribution models: Active/Passive User Distribution Model - An Active/Passive user distribution architecture simply has database copies deployed in the secondary datacenter, but no active mailboxes are hosted there and no database copies will be activated there during normal runtime operations. However, the datacenter supports both single cross-datacenter database *overs, and full datacenter activation. An Active/Active user distribution architecture has the user population dispersed across both datacenters (usually evenly) with each datacenter being the primary datacenter for its specific user population. In the event of a failure, the user population can be activated in the secondary datacenter (either via cross-datacenter single database *over or via full datacenter activation). There are two types of Active/Active user distribution models: Active/Active (Single DAG) - This model stretches a DAG across the two datacenters and has active mailboxes located in each datacenter. A corresponding passive copy is located in the alternate datacenter. This scenario does have a single point of failure (potentially), the WAN connection. Loss of the WAN connection will result in the mailbox servers in one of the datacenters going into a failed state from a failover cluster perspective (due to loss of quorum): ---DC1--- ---DC2--- DAG1 DAG1 Active Copies Passive Copies Passive Copies Active Copies Active/Active (Multiple DAGs) - This model leverages multiple DAGs to remove single points of failure (e.g., the WAN). In this model, there are least two DAGs, with each DAG having its active copies in the alternate datacenter: ---DC1--- ---DC2--- DAG-A (active/passive copies) DAG-A (Passive Copies) DAG-B (Passive Copies DAG-B (Active/Passive Copies) In your site resilient architecture, how far behind can you get in terms of log shipping between datacenters? The effect of the RPO is to evaluate the non-contiguous peak hours (defined in Step 5), say 8am and 4pm, and determine the resulting throughput requirement, assuming that you can take the time in between 8 and 4 to catch up (within the specified RPO, of course). By allowing replication to get behind there are two outcomes: 1. Active Manager is less likely to choose a database copy that has a high copy queue length (unless more viable alternatives aren't available). 2. If the copy queue length is greater than the target server's AutoDatabaseMountDial setting, the database will not automatically mount once activated. Manually mounting that database will result in the loss of data that had not been copied. Will you activation block the mailbox servers in the secondary datacenter? In certain situations (e.g. highly utilized network connection), you may want to control whether cross-site failovers occur automatically. This can be controlled by placing an activation block on the remote mailbox servers, thereby preventing Active Manager from selecting those copies during a failover. When deploying an Active/Active (Single DAG) architecture, do you want to deploy dedicated disaster recovery Mailbox servers in the alternate datacenter? If deploying a single DAG Active/Active solution, you can choose to have dedicated DR Mailbox servers deployed in the secondary datacenter to be used in the event of disaster or utilize the existing mailbox servers that are hosting active mailboxes. Mailbox Database Copy Configuration How many highly available (HA) mailbox database copy instances per database do you plan to deploy within a DAG? Enter in the number of highly available database copies you plan to have within the environment. This value excludes lagged database copies, but does account for both the active and any passive HA database copies, but includes both the active copy and all passive copies you plan to deploy. For optimal sizing, choose a multiple of the total number of mailbox servers you have selected. How many lagged database copy instances per database do you plan to deploy within a DAG? Lagged database copies are an optional feature that can provide protection against certain disaster scenarios (like logical corruption). Lagged database copies should not be considered an HA database copy as the replay will delay the availability of the database for use once activated. While technically there is no limit to how many lagged copies you can deploy within a DAG, the calculator limits you to a maximum of 2 copies. How many highly available mailbox database copy instances per database do you plan to deploy within the secondary datacenter within a DAG? If you are deploying a site resilient solution, you can choose to a portion of the total HA database copies deployed in the secondary datacenter. How many lagged database copy instances per database do you plan to deploy within the secondary datacenter within a DAG? If you are deploying a site resilient solution, you can choose to have a portion or all of your lagged database copies deployed in the secondary datacenter. Lagged Database Copy Configuration Will you deploy the lagged database copies on a dedicated server? Using a dedicated server for lagged database copies certainly makes it easier to manage. For DAGs where the lagged database copies are evenly distributed across all the DAG mailbox servers, you will need to use the Suspend-MailboxDatabaseCopy with the -ActivationOnly flag to prevent them from being mounted, but there are scenarios that can clear this. With a dedicated server you can activation block the entire server and the setting is persistent. The choice can also affect your storage design in terms of choosing RAID or JBOD. Unless you have multiple lagged copies, lagged copies should be placed on storage that is utilizing RAID to provide additional protection. The calculator will determine the appropriate number of lagged copy servers you need to deploy based on the requirements (number of databases, number of copies, etc). How long will you delay transaction log replay on your lagged copy? This parameter is used to specify the amount of time that the Microsoft Exchange Information Store service should wait before replaying log files that have been copied to the lagged database. The maximum amount of replay delay you can set is 14 days. The value you specify here will influence the log capacity requirements for all copies and the amount of time required to mount a lagged copy. How long will you delay transaction log truncation on your lagged copy? This parameter is used to specify the amount of time that the Microsoft Exchange Replication service should wait before truncating log files that have been copied to the lagged database. The time period begins after the log has been successfully replayed into the lagged copy. The maximum allowable setting for this value is 14 days. The minimum allowable setting is 0, although setting this value to 0 effectively eliminates any delay in log truncation activity. The value you specify here will influence the log capacity requirements for all copies. Exchange Data Configuration What will be the Data Overhead Factor? Microsoft recommends using 20% to account for any extraneous growth that may occur. How many mailboxes do you move per week? In terms of transactions, you have to take into account how many mailboxes you will either be moving to this server or within this server, as transactions totaling the size of the mailbox will always get generated at the target database. Are you going to deploy a Dedicated Restore LUN? A dedicated restore LUN is used as a staging point for the restoration of data or could be used during maintenance activities; if one is selected then additional capacity will not be factored into each database LUN. What percentage of disk space do you want to ensure remains free on the LUN? Most operations management programs have capacity thresholds that alert when a LUN is more than 80% utilized. This value allows you to ensure that each LUN has a certain percentage of disk space available so that the LUN is not designed and implemented at maximum capacity. Do you have log shipping compression enabled within the DAG? By default, each DAG is configured to compress and encrypt the socket connection used to ship logs across different IP subnets (you can disable these features all together or enable them for all communications regardless of subnet). What is your compression rate? The compression capability that is obtained for the socket connection used to ship logs will vary with each customer, based on the data obtained in the transaction log files. By default, Microsoft recommends using a value of 30%, however, you can determine this value by analyzing your environment (e.g., once Exchange 2010 is deployed you could evaluate the throughput rate with compression disabled and then compare with compression is enabled) . Database Configuration Do you want to follow Microsoft's recommendations regarding maximum database size? For standalone mailbox server role solutions, Microsoft recommends that the database size should not be more than 200GB in size. For solutions leveraging mailbox resiliency, Microsoft recommends that the database size should not exceed 2TB. Neither of these is by any means a hard limit, but a recommendation based on the impact database size has to recovery times. If you want to follow Microsoft's recommendation, then select Yes. Otherwise, select No. Do you want to specify a custom Maximum Database Size? If you selected No for the previous field, then you need to enter in a custom maximum database size. Would you like the calculator to determine the optimum number of databases for the design? By default the calculator will determine the optimum number of databases for the architecture. In the event that you may want to have a defined number of databases, select No to "Automatically Calculate Number of Databases" and enter in a custom number of databases. Do you want to specify the number of databases that should be deployed? If you selected "No" to the previous field, then you need to enter in the number of databases you would like to have deployed within your mailbox server or DAG architecture. Do you want to design your database infrastructure such that you deploy the correct number of databases to ensure symmetrical distribution during server failure events? If possible, the calculator will deploy the correct number of databases such that you can achieve a symmetrical distribution of the active copies across the remaining server infrastructure as the DAG experiences server failures. Note that to use this option, you must allow the calculator to automatically calculate the required number of databases. IOPS Configuration What will be the I/O Overhead Factor? Microsoft recommends using 20% to ensure adequate headroom in terms of I/O to allow for abnormal spikes in I/O that may occur from to time. What additional I/O requirements do you need to factor into the solution for each mailbox server's storage design? For example, let's say the solution requires 500 IOPS for the mailboxes and you have decided you want to ensure there is extra I/O capacity to support additional products (e.g. antivirus) to generate load during the peak user usage window. So you enter 300 IOPS in this input factor. The result is that from a host perspective, the solution needs to achieve 800 IOPS. This may require additional testing by comparing a baseline system against a system that has the I/O generating application installed and running. Mailbox Configuration Within Step 2 you will define your user profile for up to four different tiers of user populations. How many mailboxes will you deploy in the environment? If deploying a single server environment, this is how many mailboxes you will deploy on this server. If you are deploying multiple servers, then this is how many mailboxes you will deploy in the environment. If you are deploying multiple DAGs, then this is how many mailboxes you will deploy across all of the DAGs. For example, if you choose to deploy 5 servers, and want 3000 mailboxes per server, then enter 15000 here. Or if you plan to deploy 2 DAGs, each with 6 servers, and you entered 24000 total mailboxes, then 12000 mailboxes will be deployed per DAG. What is the solution's projected growth in terms of number of mailboxes over its lifecycle? Enter in the total percentage by which you believe the number of mailboxes will grow during the solution's lifecycle. For example, if you believe the solution will increase by 30% over the lifecycle of the design and you are starting out with 1000 mailboxes, then at the end of the lifecycle, the solution will have 1300 mailboxes. The calculator will utilize the projected growth plus the number of mailboxes to ensure that the capacity and performance requirements can be sustained throughout the solution's lifecycle. How much mail do the users send and receive per day on average? The usage profiles found here are based on the work done around the memory and processor scalability requirements. What is the average message size? For most customers the average message size is around 75KB. What will be the prohibit send & receive mailbox size limit? If you want to adequately control your capacity requirements, you need to set a hard mailbox size limit (prohibit send and receive) for the majority of your users. If deploying a personal archive mailbox, what will be the personal archive quota limit? If you want to adequately control your capacity requirements, you need to set a hard mailbox size limit (prohibit send and receive) for the majority of your users. What is the deleted item retention period? Enter in the deleted item retention period you plan to utilize within the environment. The default retention period is 14 days, however, you should adjust this to match your policy concerning deleted item recovery when enabling Single Item Recovery to eliminate going to backup media to recover deleted items. Are you deploying Single Item Recovery? Single Item Recovery ensures that all deleted and modified items are preserved for the duration of the deleted item retention window. By default in Exchange 2010 RTM, this is not enabled. When enabled, this feature increases the capacity requirements for the mailbox. Will you have calendar version logging enabled? By default, all changes to a calendar item are recorded in the mailbox of a user to keep older versions of meeting items for 120 days and can be used to repair the calendar in the event of an issue. This data is stored in the mailbox's dumpster folder. When enabled, this feature increases the capacity requirements for the mailbox. Do you want to include an IOPS Multiplication Factor in the prediction or custom I/O profile? The IOPS Multiplication Factor can be used to increase the IOPS/mailbox footprint for mailboxes that require additional I/O (for example, these mailboxes may use third-party mobile devices). The way this value is used is as follows: (IOPS value * Multiplication Factor) = new IOPS value. Do you want to include a Megacycle Multiplication Factor when sizing the CPU requirements for the mailbox tier? The Megacycles Multiplication Factor can be used to increase the CPU cost/mailbox footprint for mailboxes that require perform more CPU work than a typical mailbox (for example, these mailboxes may use third-party mobile devices). The way this value is used is as follows: (Megacycles value * Multiplication Factor) = new Megacycles value. Do your Outlook Online Mode clients have versions of Windows Desktop Search older than 4.0 or third-party desktop search engines deployed? The addition of these indexing tools to the online mode clients incur additional read I/O penalties to the mailbox server storage subsystem. Care should be taken when enabling these desktop search engines. Windows Desktop Search 4.0 and later utilizes synchronization protocols that are similar to how Outlook operates in cached mode to index the mailbox contents, and thus has a very minor impact in terms of disk read I/O. Are you planning to use the I/O prediction formula or define your own IOPS profile to design toward? This question asks whether you want to override the calculator in determining the IOPS / mailbox value. By default the calculator will predict the IOPS / mailbox value based on the number of messages per mailbox, and the user memory profile. For some customers that want to design toward a specific I/O profile, this option will not be viable. Therefore, if you want to design toward a specific I/O profile, select No. What is your custom IOPS profile / mailbox? Only enter a value in this field if you selected "No" to the "Predict IOPS Value" question. What will be the database read:write ratio for your custom IOPS profile? Only adjust this value if you selected "No" to the "Predict IOPS Value" parameter. When IOPS prediction is enabled, the calculator will calculate the read:write ratio based on the user profile. Backup Configuration Within Step 3 you will define your backup model and your tolerance settings, as well as, choose whether to isolate the transaction logs from the database. What backup methodology will be used to backup the solution? You have several options for a backup methodology, including leveraging a VSS solution (hardware or software based) or leveraging the native data protection features that Exchange provides. The solution you choose will depend on many factors. For example, if you are deploying the mailbox resiliency and single item recovery features, you may be able to forgo a traditional backup architecture in favor of leveraging Exchange as its own backup. Or if you still require a backup (e.g. legal/compliance reasons), then you need to deploy a VSS solution. The type of VSS solution you deploy will depend on your storage architecture. Hardware VSS solutions are available with storage area networks. Software VSS solutions can be leveraged against either storage area networks or direct attached storage architectures. Also, the backup methodology will affect the LUN design; for example, hardware VSS solutions require a LUN architecture that is 2 LUNs / Database. What will be the backup frequency? You can choose Daily Full, Weekly Full with Daily Differential, Weekly Full with Daily Incremental, or Bi-Monthly Full with Daily Incremental. The backup frequency will affect the LUN design and the disk space requirements (e.g. if performing daily differentials, then you need to account for 7 days of log generation in your capacity design). Will you isolate the logs from the database? Database/Log isolation refers to placing the DB file an logs from the same Mailbox Database on to different volumes backed by different physical disks. For standalone architectures, the Exchange best practice is to separate database file (.edb) and logs from same database to different volumes backed by different physical disks for recoverability purposes. For solutions leveraging mailbox resiliency, isolation is not required. How many times can you operate without log truncation? Select how many times you can survive without a full backup or an incremental backup (the minimum value is 1). For example, if you are a performing weekly full backup and daily differential backups, the only time log truncation occurs is during the full backup. If the full backup fails, then you have to wait an entire week to perform another full backup or perform an emergency full backup. This parameter allows you to ensure that you have enough capacity to not have to perform an immediate full backup. If you are leveraging the native data protection features within Exchange as your backup mechanism, then you should enter 3 here to ensure you have enough capacity to allow for 3 days' worth of log generation to occur as a result of potential log replication issues. How long can you survive a network outage? When a network outage occurs, log replication cannot occur. As a result, the copy queue length will increase on the source; in addition, log truncation cannot occur on the source. For geographically dispersed DAG deployments, network outages can seriously affect the solution's usefulness. If the outage is too long, log capacity on the source may become compromised and as result, capacity must be increased or a manual log truncation event must occur. Once that happens, the remote copies must be reseeded. The Network Failure Tolerance parameter ensures there is enough capacity on the log LUNs so that you can survive an excessive network outage. Storage Configuration Within Step 4 you will define your storage configuration. Storage Options Do you want to consider storage designs that leverage JBOD? JBOD storage refers to placing a database and its transaction logs on a single disk without leveraging RAID. In order to deploy this type of storage solution for your mailbox server environment, you must have 3 or more HA database copies and have a LUN architecture that is equal to 1 LUN / Database. If you select yes for this input, the calculator will attempt to design the solution so that it can be deployed on JBOD storage. Please note that other factors may alter the viability of JBOD, however (e.g. deploying a single lagged database copy on the same mailbox servers hosting your HA database copies). Primary Datacenter Disk Configuration What are the disk capacities and types you plan to deploy? For each type of LUN (database, log, and restore LUN) you plan to deploy, select the appropriate capacity and disk type model. Secondary Datacenter Disk Configuration What are the disk capacities and types you plan to deploy? For each type of LUN (database, log, and restore LUN) you plan to deploy, select the appropriate capacity and disk type model. Processor Configuration Within Step 5, you will define the number of processor core you have deployed for each mailbox server within your primary and secondary datacenters, as well as, enter the SPECint2006 Rate Value for the system you have selected. When you enable virtualization, you must be sure to configure the processor architecture correctly. In particular, you must enter in the correct number of processor cores that the guest machine will support, as well as, the correct SPECInt2006 rating for these virtual processor cores. To calculate the SPECInt2006 rate value, you can utilize the following formula: X/(N*Y) = per virtual processor SPECInt2006 Rate value Where X is the SPECInt2006 rate value for the hypervisor host server Where N = the number of physical cores in the hypervisor host Where Y = 1 if you will be deploying 1:1 virtual processor-to-physical processor on the hypervisor host Where Y = 2 if you will be deploying up to 2:1 virtual processor-to-physical processor on the hypervisor host For example, let’s say I am deploying an HP ProLiant DL580 G7 (2.27 GHz, Intel Xeon X7560) system which includes four sockets, each containing an 8-core processor, then the SPECInt2006 rate value for the system is 757. If I am deploying my Mailbox server role as a guest machine using Hyper-V, and following best practices where I do not oversubscribed the number of virtual CPUs to physical processors, then 757/(32*1) = 23.66 Since each Mailbox server will have a maximum of 4 virtual processors, this means the SPECInt2006 rate value I would enter into the calculator would be 23.66*4 = 95. In addition, if you are deploying your Exchange servers as guest machines, you can specify the Hypervisor CPU Adjustment Factor to take into account the overhead of deploying guest machines. Server Configuration How many processor cores and what is their megacycle capability are you planning to deploy in each server?For each server type (primary datacenter, secondary datacenter, and lagged copy server) you plan to deploy, select the number of processor cores and the server’s SPECint2006 rate value. To determine your SPECint2006 Rate Value: Open a web browser and got to www.spec.org. Click on Results, highlight CPU2006 and then select Search CPU2006 Results. Under Available Configurations, select SPECint2006 Rates and click Go. Under Simple Request, enter the search criteria (e.g., Processor matches x5550). Find the server and processor you are planning to deploy and take note of the result value. For example, let's say you are deploying a Dell PowerEdge M710 8-core server with Intel x5550 2.67GHz processors (2670 Hertz); the SPECint_rate2006 results value is 240. Log Replication Configuration Within Step 6, you will define your hourly log generation rate, the network link, and the network link latency you expect to have within your site resilient architecture. How many transaction logs are generated for each hour in the day? Enter in the percentage of transaction logs that are generated for each hour in the day by measuring an existing Exchange 2003 or Exchange 2007 server in your environment. If the existing messaging environment is not using Exchange, then evaluate the messaging environment and enter in the rate of change per hour here. Now you may be wondering how you can collect this data. We've written a simple VBS script that will collect all files in a folder and output it to a log file. You can use Task Scheduler to execute this script at certain intervals in the day (e.g. every 15 minutes). Once you have generated the log file for a 24 hour period, you can import it into Excel, massage the data (i.e. remove duplicate entries) and determine how many logs are generated for each hour. If you do this for each storage group, you will be able to determine your log generation rate for each hour in the day. This script is named collectlogs.vbsrename (just rename it to collectlogs.vbs) and you can find it here: Collectlogs VBS script Network Configuration What type of network link will you be using between the servers? Select the appropriate network link you will be using between the two datacenters. What is the latency on the network link? Enter in the latency (in milliseconds) that exists on the network link. Role Requirements This section provides the solution's I/O, capacity, memory, and CPU requirements. Based on the above input factors the calculator will recommend the following architecture, broken down into four sections: Environment Configuration Active Database Copy Configuration Server Configuration Log, Disk Space, and IO Requirements Processor Core Ratio Requirements This table identifies the required number of processor cores required to support the activated databases. This table is only populated if you populate the processor core megacycle information on the Input tab. The Recommended Minimum Number of Global Catalog Cores identifies the minimum number of processor cores required to sustain the load for global catalog related activities and is based on the number of processor cores required to support the activated databases. Client Access Server Requirements This table identifies the memory and CPU requirements for dedicated Client Access servers if you choose to not co-locate the server roles. This table is only populated if you populate the processor core megacycle information on the Input tab. The Recommended Minimum Number of Client Access Processor Cores identifies the minimum number of processor cores required per server to sustain the load for client related activities. The Recommended Minimum Number of Client Access Server RAM Configuration identifies the minimum amount of memory required per server to sustain the load for client related activities. This number is scaled to a multiple that can be typically installed in a server. Hub Transport Server Requirements This table identifies the memory and CPU requirements for dedicated Hub Transport servers if you choose to not co-locate the server roles. This table is only populated if you populate the processor core megacycle information on the Input tab. The Recommended Minimum Number of Hub Transport Processor Cores identifies the minimum number of processor cores required per server to sustain the load for client related activities. The Recommended Minimum Number of Hub Transport RAM Configuration identifies the minimum amount of memory required per server to sustain the load for client related activities. This number is scaled to a multiple that can be typically installed in a server. Environment Configuration The Environment Configuration table identifies the number of mailboxes being deployed in each datacenter, as well as, how many mailbox servers and lagged copy servers you will deploy in each datacenter. This table will also identify the minimum number of dedicated Hub Transport and Client Access servers you should deploy in each datacenter (taking into account worst case failure mode of two simultaneous server failures). User Mailbox Configuration The Mailbox Configuration table provides you with: The Number of Mailboxes that you entered in the Input section (this value will include the projected growth). The Number of Mailboxes / Database provides a breakdown of how many mailboxes from each mailbox tier will be stored within a database. The User Mailbox Size within Database is the actual mailbox size on disk that factors in the prohibit send/receive limit, the number of messages the user sends/receives per day, the deleted item retention window (with or without calendar version logging and single item recovery enabled), and the average database daily churn per mailbox. It is important to note that the Mailbox size on disk is actually higher than your mailbox size limit; this is to be expected. The Transaction Logs Generated / Mailbox value is based on the message profile selected and the average message size and indicates how many transaction logs will be generated per mailbox per day. The log generation numbers per message profile account for: Message size impact. In our analysis of the databases internally we have found that 90% of the database is the attachments and message tables (message bodies and attachments). So if the average message size doubles (from 75 to 150), the worst case scenario would be for the log traffic to increase by 1.9 times. Thereafter, as message size doubles, the impact doubles. Amount of data Sent/received. Database health maintenance operations. Records Management operations Data stored in mailbox that is not a message (tasks, local calendar appts, contacts, etc). Forced log rollover (a mechanism that periodically closes the current transaction log file and creates the next generation). The IOPS / Mailbox value is the calculated IOPS / Mailbox value that is based on the number of messages per mailbox, the user memory profile, and desktop search engine choices. If you had chosen to enter in a specific IOPS / mailbox value rather than allowing the calculator determining the value based on the above requirements, then this value will be that custom value. The Read:Write ratio / Mailbox value defines the ratio of the mailbox's IOPS that are read I/Os. This information is required to accurately design the storage subsystem I/O requirements. Database Copy Instance Configuration This table highlights how many HA mailbox database copy instances and lagged database copy instances your solution will have within each datacenter for a given DAG. Database Configuration The Database Configuration table provides you with: The Number of Databases is the calculated number of databases required to support the mailbox population within a standalone server or DAG. The Recommended Number of Mailboxes / Database is the calculated number of mailboxes per database ensuring that the database size does not go above the recommended database size limit. The Available Database Cache / Mailbox value is the amount of database cache memory that is available per mailbox. A large database cache ensures that read I/Os can be reduced. Database Copy Configuration The Database Copy Configuration table provides you with the number of database copies being deployed within each server and the total number of database copies within the DAG. Server Configuration The Server Configuration table provides you with the following: The Recommended RAM Configuration for the primary datacenter mailbox servers, secondary datacenter mailbox servers, and lagged copy servers. This is the amount of RAM needed to support the number of maximum activated database copies on a given server, in addition to, the number of mailboxes based on their memory profile. The number of processor cores utilized during the worst failure mode scenario. The CPU Utilization value is the expected CPU Utilization for a fully utilized server based on the megacycles associated with the user profile and the number of database copies. Depending on the environment, this will either be for a standalone server hosting 100% active databases, or a server participating in a DAG that is dealing with a single or double server failure event (or secondary datacenter activation). It is recommended that servers not exceed 80% utilization during peak period. The CPU utilization value is determined by taking the CPU Megacycle Requirements and dividing it by the total number of megacycles available on the server (which is based on the CPU and number of cores). If the calculator highlights the CPU utilization with a red background, then this means the design may not be able to sustain the load - either you must change the design (number of mailboxes, number of copies, etc.) or change the server CPU platform. The CPU Megacycle Requirements value defines the amount of megacycles the primary datacenter servers must be able to sustain when either all mailbox databases are active or the number of mailbox database copies that are activated based on a single server or double server failure event. For secondary servers hosting HA copies, this value defines the amount of megacycles required to support the activation of all databases after datacenter activation. For lagged copy servers, this value defines the amount of megacycles required to support all of the passive lagged copies. The Server Total Available Adjusted Megacycles value defines the total available megacycles the server platform is capable of delivering at 100% CPU utilization. This value has been normalized against the baseline server platform (Intel Xeon x5470 2x4 3.33GHZ processors). The Possible Storage Architecture outlines whether the solution could utilize RAID or JBOD for the primary datacenter servers, secondary datacenter servers, and lagged copy servers. JBOD is only considered under the following conditions (this assumes you configured the calculator to consider JBOD): In order to deploy on JBOD in the primary datacenter servers: You need a total of 3 or more HA copies within the DAG. If you are mixing lagged copies on the same server that is hosting your HA copies (i.e. not using dedicated lagged copy servers), then you need at least 2 lagged copies. For the secondary datacenter servers to use JBOD: You should have at least 2 HA copies in secondary datacenter. That way loss of a copy in the secondary datacenter doesn't result in requiring a reseed across the WAN or loss of data (in the datacenter activation case). If you are mixing lagged copies on the same server that is hosting your HA copies (i.e. not using dedicated lagged copy servers), then you need at least 2 lagged copies. For dedicated lagged copy servers: You should have at least 2 lagged copies within a datacenter in order to use JBOD. Otherwise loss of disk results in loss of your lagged copy (and whatever protection mechanism that was providing). Transaction Log Requirements The Transaction Log Requirements table provides you with: The User Transaction Logs Generated / Day indicates how many transaction logs will be generated during the day for each active database, each server, within the DAG, and within the environment. The Average Mailbox Move Transaction Logs Generated / Day indicates how many transaction logs will be generated during the day for active database, each server, within a DAG, and within the environment. This number is an assumption and assumes that an equal percentage of mailboxes will be moved each day, as opposed to moving all mailboxes on the same day. The Average Transaction Logs Generated / Day is the total number of transaction logs that are generated per day for active database, each server, within a DAG, and within the environment (includes user generated logs and mailbox move generated logs). Disk Space Requirements The Disk Space Requirements table provides you with: The Database Space Required is the amount of space required to support each database and its corresponding copies. This value is derived from the mailbox size on disk, the data overhead factor, whether a dedicated restore LUN is available. This row also shows you the space requirements for each server (based on the total number of database copies), each DAG, and within the environment. The Log Space Required is the amount of space required to support each database log stream and the corresponding copies. This value takes into account the number of mailboxes moved per week (assumes worst case and that all mailboxes are moved on the same day), the type of backup frequency in use, the number of days that can be tolerated without log truncation and the number of transaction logs generated per day. This row also shows you the space requirements for each server (based on the total number of database copies), each DAG, and within the environment. The Database LUN Space Required is the LUN size required to support the database (and potentially its log stream). This calculation takes the total disk space required for the database and adds to it the size of a database plus 110% (if a dedicated restore LUN does not exist) for offline maintenance operations, an additional 10% of the database size for content indexing (if enabled), and includes an amount of free space to ensure the LUN is not 100% utilized (based on LUN Free Space Percentage). This row also shows you the space requirements for each server (based on the total number of database copies), each DAG, and within the environment. The Log LUN Space Required is the LUN size required to support the databases log stream. This field lists the amount of space required to support the transaction logs for a given set of databases and includes an amount of free space to ensure the LUN is not 100% utilized (based on LUN Free Space Percentage). This row also shows you the space requirements for each server (based on the total number of database copies), each DAG, and within the environment. The Restore LUN Space Required is the amount of space needed to support a restore LUN if the option was selected in the Input Factor section; this will include space for up to 7 databases and 7 transaction log sets. Each server will be provisioned with a restore LUN. This row also shows you the space requirements for each server (based on the total number of database copies), each DAG, and within the environment. Host IO and Throughput Performance Requirements The Host IO and Throughput Performance Requirements table provides you with: The Total Required Database IOPS is the amount of read and write host I/O the database disk set must sustain during peak load (this does not factor in any RAID penalties). This row also shows you the IOPS requirements for each server (based on the total number of database copies), each DAG, and within the environment The Total Required Log IOPS is the amount of read and write host I/O that will occur against the transaction log disk set. This row also shows you the IOPS requirements for each server (based on the total number of database copies), each DAG, and within the environment. The Database Read I/O Percentage defines the percentage of database required IOPS that are read I/Os. This information is required to accurately design the storage subsystem I/O requirements. The amount of throughput required for Background Database Maintenance operations. Special Notes The Special Notes table will provide you with additional information about your design: When to use GPT disks (when a LUN size is greater than 2TB). Activation Scenarios If you are deploying a highly available and/or site resilient architecture, then this section will break down the failure scenarios. The section is broken up into two scenarios: Scenario 1 - Deploying a DAG architecture within a single datacenter or deploying it in a site resilient Active/Passive user distribution model. Scenario 2 - Deploying a site resilient DAG architecture with an Active/Active user distribution model. Important: For the purposes of this calculator, the term "primary datacenter" refers to the datacenter that is preferred for hosting the active copies for a given set of databases, while the term "secondary datacenter" refers to the disaster recovery datacenter that is used for datacenter activation and cross-site database failover events. Single Datacenter and Active/Passive Environments The DAG Member Layout table identifies the number of Active Mailbox servers (those that are hosting active mailboxes within the primary datacenter), the Disaster Recovery Mailbox Servers (those that host passive database copies in the second datacenter), and any Lagged Copy Mailbox servers you may be deploying. There are two tables that provide data around the Active Database configuration, one for the primary datacenter, which outlines the single or double server events, and one for the secondary datacenter, which outlines the activation of that datacenter when the primary datacenter is lost. Both tables provide you with: The Number of Active Databases (Normal Run Time) value defines the number of active databases hosted on each server when there are no server outages. Unlike Exchange 2007, Exchange 2010 is no longer bound by an active/passive high availability model. Instead, each server within a DAG can host active mailbox database copies. The calculator distributes the number of unique databases across the primary datacenter servers within the DAG, ensuring an equal distribution of mailbox database copies are activated on each server. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. The Number of Active Databases (After First Server Failure) value defines the number of active databases hosted on each server when there is a single server outage. As a result of the single server outage, the database copies that were activated on the failed server are equally redistributed across all remaining server nodes. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. The Number of Active Databases (After Double Server Failure) value is populated when you have at least 3 HA mailbox copies and at least 4 mailbox servers within your design. It defines the number of active databases hosted on each server when there are two server outages. As a result of the double server outage, the database copies that were activated on the failed servers are equally redistributed across all remaining server nodes. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. Active/Active Environments This section breaks out the architecture into two perspectives, the layout of Datacenter 1 and the layout of Datacenter 2 with respect to the DAG architecture. Recall that Active/Active (Single DAG) - This model stretches a DAG across the two datacenters and has active mailboxes located in each datacenter. A corresponding passive copy is located in the alternate datacenter. This scenario does have a single point of failure (potentially), the WAN connection. Loss of the WAN connection will result in the mailbox servers in one of the datacenters going into a failed state from a failover cluster perspective (due to loss of quorum): ---DC1--- ---DC2--- DAG1 DAG1 Active Copies Passive Copies Passive Copies Active Copies Active/Active (Multiple DAGs) - This model leverages multiple DAGs to remove single points of failure (e.g., the WAN). In this model, there are least two DAGs, with each DAG having its active copies in the alternate datacenter: ---DC1--- ---DC2--- DAG-A (active/passive copies) DAG-A (Passive Copies) DAG-B (Passive Copies DAG-B (Active/Passive Copies) The DAG Member Layout table identifies the number of Active Mailbox servers (those that are hosting active mailboxes within the primary datacenter), the Disaster Recovery Mailbox Servers (those that host passive database copies in the second datacenter), and any Lagged Copy Mailbox servers you may be deploying. There are two tables that provide data around the Active Database configuration, one for the primary datacenter, which outlines the single or double server events, and one for the secondary datacenter, which outlines the activation of that datacenter when the primary datacenter is lost. Both tables provide you with: The Number of Active Databases (Normal Run Time) value defines the number of active databases hosted on each server when there are no server outages. Unlike Exchange 2007, Exchange 2010 is no longer bound by an active/passive high availability model. Instead, each server within a DAG can host active mailbox database copies. The calculator distributes the number of unique databases across the primary datacenter servers within the DAG, ensuring an equal distribution of mailbox database copies are activated on each server. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. The Number of Active Databases (After First Server Failure) value defines the number of active databases hosted on each server when there is a single server outage. As a result of the single server outage, the database copies that were activated on the failed server are equally redistributed across all remaining server nodes. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. The Number of Active Databases (After Double Server Failure) value is populated when you have at least 3 HA mailbox copies and at least 4 mailbox servers within your design. It defines the number of active databases hosted on each server when there are two server outages. As a result of the double server outage, the database copies that were activated on the failed servers are equally redistributed across all remaining server nodes. This row exposes the total number of mailboxes that are accessible on each server as a result of the activated database copies. In addition, this row highlights the total number of databases deployed within each datacenter, in the event that cross-site database *overs are allowed to occur. Distribution The calculator includes a new worksheet, Distribution. Within the Distribution worksheet, you will find the layout we recommend based on the database copy layout principles. The Distribution worksheet includes several new options to help you with designing and deploying your database copies: You can determine what the active copy distribution will be like as server failures occur within your environment. You can export a set of CSV and PowerShell scripts that perform the following actions: Diskpart.ps1 (uses Servers.csv) - Formats the physical disks, and mounts them as mount points under an anchor directory on each server. CreateMBDatabases.ps1 (uses MailboxDatabases.csv) - Creates the mailbox database copies with activation preference value of 1. CreateMBDatabaseCopies.ps1 (uses MailboxDatabaseCopies.csv) - Creates the mailbox database copies with the appropriate activation preference values across the server infrastructure. Important: The database copy layout the tool provides assumes that each server and its associated database copies are isolated from each other server/copies. It is important to take into account failure domain aspects when planning your database copy layout architecture so that you can avoid having multiple copy failures for the same database LUN Design The LUN Requirements section is really a continuation of the Storage Requirements section. It outlines what we believe is the appropriate LUN design based on the input factors and the analysis performed in the previous sections. Note: The term LUN utilized in the calculator refers only the representation of the disk that is exposed to the host operating system. It does not define the disk configuration. The LUN Design highlights the LUN architecture chosen for this server solution. The architecture is derived from the backup type, backup frequency, and high availability architecture that were chosen in the Storage Requirements section. There are three types of LUN architecture that can be leveraged within Exchange 2010: 1 LUN / Database 2 LUNs / Database 2 LUNs / Backup Set 1 LUN / Database A single LUN per Database architecture means that both the database and its corresponding log files are placed on the same LUN. In order to deploy a LUN architecture that only utilizes a single LUN per database, you must have a Database Availability Group that has 2 or more copies and not be utilizing a hardware based VSS solution. Some of the benefits of this strategy include: Simplified storage administration. Fewer LUNs to manage. Potentially reduce the number of backup jobs. Flexibility to isolate the performance between Databases when not sharing spindles between LUNs. Some of the concerns with this strategy include: Limits the ability to take hardware based VSS backup and restores (e.g., clone snapshots). See Best Practices for Using Volume Shadow Copy Service with Exchange Server 2003 for more VSS details. 2 LUNs / Database With Exchange 2010, in the maximum case of 100 Databases, the number of LUNs you provision will depend upon your backup strategy. If your recovery time objective (RTO) is very small, or if you use VSS clones for fast recovery, it may be best to place each Database on its own transaction log LUN and database LUN. Because doing this will exceed the number of available drive letters, volume mount points must be used. Some of the benefits of this strategy include: Enables hardware-based VSS at a database level, providing single database backup and restore. Flexibility to isolate the performance between databases when not sharing spindles between LUNs. Increased reliability. A capacity or corruption problem on a single LUN will only impact one database. This is an important consideration when you are not leveraging the built-in mailbox resiliency features. Some of the concerns with this strategy include: 100 databases require 200 LUNs which could exceed some storage array maximums. A separate LUN for each database causes more LUNs per server increasing the administrative costs and complexity. 2 LUNs / Backup Set A backup set is the number of databases that are fully backed up in a night. A solution that performs a full backup on 1/7th of the databases nightly (i.e. using a weekly or bi-monthly full backup with daily incrementals or differentials) can reduce complexity by placing all of the databases to be backed up on the same log and database LUN. This can reduce the number of LUNs on the server. Some of the benefits of this strategy include: Simplified storage administration. Fewer LUNs to manage. Potentially reduce the number of backup jobs. Some of the concerns with this strategy include: Limits the ability to take hardware based VSS backup and restores (e.g., clone snapshots). See Best Practices for Using Volume Shadow Copy Service with Exchange Server 2003 for more VSS details. A capacity or corruption problem on a single LUN could impact more than one Database. Results Pane Based on the above input factors the calculator will recommend the following architecture: LUN Design The LUN Design table highlights the recommended LUN architecture. LUN Configuration The LUN Configuration table highlights the number of databases that should be placed on a single LUN. This is derived from LUN Architecture model. This section also documents how many LUNs will be required for the entire solution, broken out by Database and Log sets, and the number of restore LUNs per server. Database Configuration The Database Configuration table outlines the number of databases (or copies) per server, the number of mailboxes per database, the size of each database, and the transaction log size required for each database. Database and Log LUN Design The database and log LUN Design table outlines the physical LUN layout and follows the recommended number of databases per LUN approach based on the LUN Architecture model. It also documents the LUN size required to support layout (this is where we factor in the additional capacity for content indexing, the LUN Free Space Percentage, and whether you are using a Restore LUN), as well as the transaction log LUN. Important: The DB and Log LUN Design Table identify databases by a unique number. However, databases copies are distributed across the servers, and thus, these numbers hold no significance and are used solely as an example to show a server's LUN layout. Backup Requirements The Backup Requirements section is really a continuation of the Role Requirements section. It outlines what we believe is the appropriate backup design based on the input factors and the analysis performed in the previous sections. Backup Configuration The Backup Configuration table outlines the number of databases that will be placed within a single LUN and the type of backup methodology and frequency in which the backups will occur. Backup Frequency Configuration The Backup Frequency Configuration section will provide you with an outline on how you should perform the backups for each server, utilizing either a daily full backup or weekly or bi-monthly full backup frequency. Log Replication Requirements The Log Replication Requirements section is another continuation of the Role Requirements section. It outlines what we believe is the throughput required to replicate the transaction logs to each target database copy in the secondary datacenter. Peak Log and Content Index Replication Throughput Requirements The Peak Log and Content Index Replication Throughput Requirements table provides you with: The Peak Log & Content Index Throughput Required / Database is the total throughput required for a single log stream and content index. This value is based on the peak log generation hour. The Peak Log & Content Index Throughput Required Between Datacenters / DAG is the total throughput required to replicate the transaction logs and content index to all database copies (lagged and non-lagged) that exist within the alternate datacenter for the database availability group. The Peak Log & Content Index Throughput Required Between Datacenters / Environment is the total throughput required to replicate the transaction logs and content index to all database copies (lagged and non-lagged) that exist within the alternate datacenter for all database availability groups. RPO Log and Content Index Replication Throughput Requirements In terms of log replication, RPO means how behind can you get in log shipping? The lower the RPO (a value of 0 or 1 essentially means you want to only lose the open log file), the higher the bandwidth you need because you cannot get behind in log replication. The higher the RPO (approaching 24) less bandwidth is needed as you are expecting to be behind (up to x hours) in log replication and to catch up at some point in the day. The RPO Log and Content Index Replication Throughput Requirements table provides you with: The RPO Log & Content Index Throughput Required / Database is the required throughput necessary to replicate the transaction logs and content index based on the RPO to the mailbox servers that are located within the secondary datacenter per database. The RPO Log & Content Index Throughput Required Between Datacenters / DAG is the RPO total throughput required to replicate the transaction logs and content index to all database copies (lagged and non-lagged) that exist within the alternate datacenter for the database availability group. The RPO Log & Content Index Throughput Required Between Datacenters / Environment is the RPO total throughput required to replicate the transaction logs and content index to all database copies (lagged and non-lagged) that exist within the alternate datacenter for all database availability groups. Chosen Network Link Suitability The Chosen Network Link Suitability table will dictate whether the chosen network link has sufficient capacity to sustain the peak replication throughput requirements and/or the RPO replication throughput requirements. If the network link cannot sustain the log replication traffic, then you will need to either upgrade the network link to the recommended network link throughput, or adjust the design appropriately. Recommended Network Link The Recommended Network Link table recommends an appropriate network link if the chosen network link does not have sufficient capacity to sustain log replication for solution for both the peak and RPO throughput requirements. Note: The Network Link recommendations do not take into account database seeding or any other data that may also utilize the link. Storage Design The Storage Design worksheet is designed to take the data collected from the Input worksheet and Storage Requirements worksheet and help you determine the number of physical disks needed to support the databases, transaction logs, and Restore LUN configurations. Storage Design Input Factors In order to determine the physical disk requirements, you must enter in some basic information about your storage solution. RAID Parity Configuration For the Database/Log RAID Parity Configuration table you need to select the type of RAID building block your storage solution utilizes. For example, some storage vendors build the underlying storage in sets of data+parity (d+p) groups. A RAID-5 3+1 configuration means that 3 disks will be used for capacity and 1 disk will be used for parity, even though parity is distributed across all the disks. So if you had a capacity requirement that would utilize 15 disks, then you would need to deploy 5 3+1 groups to build that RAID-5 array. RAID-1/0 supports 1d+1p, 2d+2p, and 4d+4p groupings RAID-5 supports 3d+1p through 20d+1p groupings (though storage solutions could support more than that). RAID-6 supports 6d+2p groupings. Database/Log RAID Rebuild Overhead When a disk is lost, the disk needs to be replaced and rebuilt. During this time, the performance of the RAID group is affected. This impact as a result can affect user actions. Therefore, to ensure that RAID rebuilds do not affect the overall performance of the mailbox server, Microsoft recommends that you should ensure sufficient overhead is provisioned into the performance calculations when designing for RAID parity. Most RAID-1/0 implementations will suffer a 25% performance penalty during a rebuild. Most RAID-5 and RAID-6 implementations will suffer a 50% performance penalty during a rebuild. The calculator defaults with the following as Microsoft recommendations, but they are adjustable: For RAID-1/0 implementations, ensure that you factor in an additional 35% performance overhead. For RAID-5/RAID-6 implementations, ensure that you factor in an additional 100% performance overhead. In addition, you should consult with your storage vendor to determine the appropriate RAID rebuild penalty. Database RAID Configuration By default, for RAID storage solutions, the calculator will recommend either RAID-1/0 or RAID-5 by evaluating capacity and I/O factors and determining which configuration utilizes the least amount of disks while satisfying the requirements. If you would like to override this and force the calculator to utilize a particular RAID configuration for your databases (e.g., RAID-0 or RAID-6), select "Yes" to this option and then select the appropriate RAID configuration in the cell labeled "Desired RAID Configuration." Note that while you can potentially override the database RAID configuration, you cannot do so for the log RAID configuration - that will always be RAID-1/0. Note: The calculator prevents the use of RAID-5 or RAID-6 with 5.2K, 5.4K, 5.9K and 7.2K disk types, due to performance implications. Restore LUN RAID Configuration You can select the type of parity you will be utilizing and the RAID configuration you will be deploying for your Restore LUN. Results Pane The Storage Design Results section outputs the recommended configuration for the solution. The recommendations made are for implementing the solution potentially on RAID and JBOD storage. RAID Storage Architecture The RAID Storage Architecture Table outlines which servers (primary datacenter servers, secondary datacenter servers, or lagged copy servers) should be deployed on RAID storage. The RAID Storage Architecture / Server table recommends the optimum RAID configuration and number of disks for each LUN (database, log and restore LUN) for each mailbox server ensuring that performance and capacity requirements are met within the design. JBOD Storage Architecture The JBOD Storage Architecture Table outlines which servers (primary datacenter servers, secondary datacenter servers, or lagged copy servers) could be deployed on JBOD storage. The JBOD Storage Architecture / Server table recommends the optimum JBOD configuration and number of disks for each LUN (database, log and restore LUN) for each mailbox server ensuring that performance and capacity requirements are met within the design. Total Disks Required By default, the calculator will determine the storage architecture that should be utilized to reduce the total number of disks required to support the design, in addition, to still ensuring you have minimized single points of failure by utilizing RAID and/or JBOD based on the decisions found in the "RAID Storage Architecture" and "JBOD Storage Architecture" tables. However, you can change the storage architecture to be built entirely on RAID or entirely on JBOD (if the design supports JBOD as a possible solution; also keep in mind that certain scenarios (e.g., a single database copy in a datacenter) may result in a single point of failure) by selecting the appropriate value in the "Storage Architecture will be Deployed:" drop-down. The Storage Configuration table will output the total number of disks required for each mailbox server that requires RAID or JBOD storage, as well as, identify the total number of disks requiring RAID or JBOD storage in each datacenter. Conclusion Hopefully you will find this calculator invaluable in helping to determine your mailbox server role requirements for Exchange 2010 mailbox servers. If you have any questions or suggestions, please email strgcalc AT microsoft DOT com. For the calculator itself, please see the attachment on this blog post. Ross Smith IV254KViews0likes35CommentsDatabase Maintenance in Exchange 2010
Over the last several months there has been significant chatter around what is background database maintenance and why is it important for Exchange 2010 databases. Hopefully this article will answer these questions. What maintenance tasks need to be performed against the database? The following tasks need to be routinely performed against Exchange databases: Database Compaction Database Defragmentation Database Checksumming Page Patching Page Zeroing Database Compaction The primary purpose of database compaction is to free up unused space within the database file (however, it should be noted that this does not return that unused space to the file system). The intention is to free up pages in the database by compacting records onto the fewest number of pages possible, thus reducing the amount of I/O necessary. The ESE database engine does this by taking the database metadata, which is the information within the database that describes tables in the database, and for each table, visiting each page in the table, and attempting to move records onto logically ordered pages. Maintaining a lean database file footprint is important for several reasons, including the following: Reducing the time associated with backing up the database file Maintaining a predictable database file size, which is important for server/storage sizing purposes. Prior to Exchange 2010, database compaction operations were performed during the online maintenance window. This process produced random IO as it walked the database and re-ordered records across pages. This process was literally too good in previous versions – by freeing up database pages and re-ordering the records, the pages were always in a random order. Coupled with the store schema architecture, this meant that any request to pull a set of data (like downloading items within a folder) always resulted in random IO. In Exchange 2010, database compaction was redesigned such that contiguity is preferred over space compaction. In addition, database compaction was moved out of the online maintenance window and is now a background process that runs continuously. Database Defragmentation Database defragmentation is new to Exchange 2010 and is also referred to as OLD v2 and B+ tree defragmentation. Its function is to compact as well as defragment (make sequential) database tables that have been marked/hinted as sequential. Database defragmentation is important to maintain efficient utilization of disk resources over time (make the IO more sequential as opposed to random) as well as to maintain the compactness of tables marked as sequential. You can think of the database defragmentation process as a monitor that watches other database page operations to determine if there is work to do. It monitors all tables for free pages, and if a table gets to a threshold where a significant high percentage of the total B+ Tree page count is free, it gives the free pages back to the root. It also works to maintain contiguity within a table set with sequential space hints (a table created with a known sequential usage pattern). If database defragmentation sees a scan/pre-read on a sequential table and the records are not stored on sequential pages within the table, the process will defrag that section of the table, by moving all of the impacted pages to a new extent in the B+ tree. You can use the performance counters (mentioned in the monitoring section) to see how little work database defragmentation performs once a steady state is reached. Database defragmentation is a background process that analyzes the database continuously as operations are performed, and then triggers asynchronous work when necessary. Database defragmentation is throttled under two scenarios: The max number of outstanding tasks This keeps database defragmentation from doing too much work the first pass if massive change has occurred in the database. A latency throttle of 100ms When the system is overloaded, database defragmentation will start punting defragmentation work. Punted work will get executed the next time the database goes through that same operational pattern. There's nothing that remembers what defragmentation work was punted and goes back and executes it once the system has more resources. Database Checksumming Database checksumming (also known as Online Database Scanning) is the process where the database is read in large chunks and each page is checksummed (checked for physical page corruption). Checksumming’s primary purpose is to detect physical corruption and lost flushes that may not be getting detected by transactional operations (stale pages). With Exchange 2007 RTM and all previous versions, checksumming operations happened during the backup process. This posed a problem for replicated databases, as the only copy to be checksummed was the copy being backed up. For the scenario where the passive copy was being backed up, this meant that the active copy was not being checksummed. So in Exchange 2007 SP1, we introduced a new optional online maintenance task, Online Maintenance Checksum (for more information, see Exchange 2007 SP1 ESE Changes – Part 2). In Exchange 2010, database scanning checksums the database and performs post Exchange 2010 Store crash operations. Space can be leaked due to crashes, and online database scanning finds and recovers lost space. Database checksum reads approximately 5 MB per second for each actively scanning database (both active and passive copies) using 256KB IOs. The I/O is 100 percent sequential. The system in Exchange 2010 is designed with the expectation that every database is fully scanned once every seven days. If the scan takes longer than seven days, an event is recorded in the Application Log : Event ID: 733 Event Type: Information Event Source: ESE Description: Information Store (15964) MDB01: Online Maintenance Database Checksumming background task is NOT finishing on time for database 'd:\mdb\mdb01.edb'. This pass started on 11/10/2011 and has been running for 604800 seconds (over 7 days) so far. If it takes longer than seven days to complete the scan on the active database copy, the following entry will be recorded in the Application Log once the scan has completed: Event ID: 735 Event Type: Information Event Source: ESE Description: Information Store (15964) MDB01 Database Maintenance has completed a full pass on database 'd:\mdb\mdb01.edb'. This pass started on 11/10/2011 and ran for a total of 777600 seconds. This database maintenance task exceeded the 7 day maintenance completion threshold. One or more of the following actions should be taken: increase the IO performance/throughput of the volume hosting the database, reduce the database size, and/or reduce non-database maintenance IO. In addition, an in-flight warning will also be recorded in the Application Log when it takes longer than 7 days to complete. In Exchange 2010, there are now two modes to run database checksumming on active database copies: Run in the background 24×7 This is the default behavior. It should be used for all databases, especially for databases that are larger than 1TB. Exchange scans the database no more than once per day. This read I/O is 100 percent sequential (which makes it easy on the disk) and equates to a scanning rate of about 5 megabytes (MB)/sec on most systems. The scanning process is single threaded and is throttled by IO latency. The higher the latency, the more database checksum slows down because it is waiting longer for the last batch to complete before issuing another batch scan of pages (8 pages are read at a time). Run in the scheduled mailbox database maintenance process When you select this option, database checksumming is the last task. You can configure how long it runs by changing the mailbox database maintenance schedule. This option should only be used with databases smaller than 1 terabyte (TB) in size, which require less time to complete a full scan. Regardless of the database size, our recommendation is to leverage the default behavior and not configure database checksum operations against the active database as a scheduled process (i.e., don’t configure it as a process within the online maintenance window). For passive database copies, database checksums occur during runtime, continuously operating in the background. Page Patching Page patching is the process where corrupt pages are replaced by healthy copies. As mentioned previously, corrupt page detection is a function of database checksumming (in addition, corrupt pages are also detected at run time when the page is stored in the database cache). Page patching works against highly available (HA) database copies. How a corrupt page is repaired depends on whether the HA database copy is active or passive. Page patching process On active database copies On passive database copies A corrupt page(s) is detected. A marker is written into the active log file. This marker indicates the corrupt page number and that page requires replacement. An entry is added to the page patch request list. The active log file is closed. The Replication service ships the log file to passive database copies. The Replication service on a target Mailbox server receives the shipped log file and inspects it. The Information Store on the target server replays the log file and replays up to marker, retrieves its healthy version of the page, invokes Replay Service callback and ships the page to the source Mailbox server. The source Mailbox server receives the healthy version of the page, confirms that there is an entry in the page patch request list, then writes the page to the log buffer, and correspondingly, the page is inserted into the database cache. The corresponding entry in the page patch request list is removed. At this point the database is considered patched (at some later point the checkpoint will advance and the database cache will be flushed and the corrupt page on disk will be overwritten). Any other copy of this page (received from another passive copy) will be silently dropped, because there is no corresponding entry in the page patch request list. On the Mailbox server where the corrupt page(s) is detected, log replay is paused for the affected database copy. The replication service coordinates with the Mailbox server that is hosting the active database copy and retrieves the corrupted page(s) and the required log range from the active copy’s database header. The Mailbox server updates the database header for the affected database copy, inserting the new required log range. The Mailbox server notifies the Mailbox server hosting the active database copy which log files it requires. The Mailbox server receives the required log files and inspects them. The Mailbox server injects the healthy versions of the database pages it retrieved from the active database copy. The pages are written to the log buffer, and correspondingly, the page is inserted into the database cache. The Mailbox server resumes log replay. Page Zeroing Database Page Zeroing is the process where deleted pages in the database are written over with a pattern (zeroed) as a security measure, which makes discovering the data much more difficult. With Exchange 2007 RTM and all previous versions, page zeroing operations happened during the streaming backup process. In addition since they occurred during the streaming backup process they were not a logged operation (e.g., page zeroing did not result in the generation of log files). This posed a problem for replicated databases, as the passive copies never had its pages zeroed, and the active copies would only have it pages zeroed if you performed a streaming backup. So in Exchange 2007 SP1, we introduced a new optional online maintenance task, Zero Database Pages during Checksum (for more information, see Exchange 2007 SP1 ESE Changes – Part 2). When enabled this task would zero out pages during the Online Maintenance Window, logging the changes, which would be replicated to the passive copies. With the Exchange 2007 SP1 implementation, there is significant lag between when a page is deleted to when it is zeroed as a result of the zeroing process occurring during a scheduled maintenance window. So in Exchange 2010 SP1, the page zeroing task is now a runtime event that operates continuously, zeroing out pages typically at transaction time when a hard delete occurs. In addition, database pages can also be scrubbed during the online checksum process. The pages targeted in this case are: Deleted records which couldn’t be scrubbed during runtime due to dropped tasks (if the system is too overloaded) or because Store crashed before the tasks got to scrub the data; Deleted tables and secondary indices. When these get deleted, we don’t actively scrub their contents, so online checksum detects that these pages don’t belong to any valid object anymore and scrubs them. For more information on page zeroing in Exchange 2010, see Understanding Exchange 2010 Page Zeroing. Why aren’t these tasks simply performed during a scheduled maintenance window? Requiring a scheduled maintenance window for page zeroing, database defragmentation, database compaction, and online checksum operations poses significant problems, including the following: Having scheduled maintenance operations makes it very difficult to manage 24x7 datacenters which host mailboxes from various time zones and have little or no time for a scheduled maintenance window. Database compaction in prior versions of Exchange had no throttling mechanisms and since the IO is predominantly random, it can lead to poor user experience. Exchange 2010 Mailbox databases deployed on lower tier storage (e.g., 7.2K SATA/SAS) have a reduced effective IO bandwidth available to ESE to perform maintenance window tasks. This is an issue because it means that IO latencies will increase during the maintenance window, thus preventing the maintenance activities to complete within a desired period of time. The use of JBOD provides an additional challenge to the database in terms of data verification. With RAID storage, it's common for an array controller to background scan a given disk group, locating and re-assigning bad blocks. A bad block (aka sector) is a block on a disk that cannot be used due to permanent damage (e.g. physical damage inflicted on the disk particles). It's also common for an array controller to read the alternate mirrored disk if a bad block was detected on the initial read request. The array controller will subsequently mark the bad block as “bad” and write the data to a new block. All of this occurs without the application knowing, perhaps with just a slight increase in the disk read latency. Without RAID or an array controller, both of these bad block detection and remediation methods are no longer available. Without RAID, it's up to the application (ESE) to detect bad blocks and remediate (i.e., database checksumming). Larger databases on larger disks require longer maintenance periods to maintain database sequentiality/compactness. Due to the aforementioned issues, it was critical in Exchange 2010 that the database maintenance tasks be moved out of a scheduled process and be performed during runtime continuously in the background. Won’t these background tasks impact my end users? We’ve designed these background tasks such that they're automatically throttled based on activity occurring against the database. In addition, our sizing guidance around message profiles takes these maintenance tasks into account. How can I monitor the effectiveness of these background maintenance tasks? In previous versions of Exchange, events in the Application Log would be used to monitor things like online defragmentation. In Exchange 2010, there are no longer any events recorded for the defragmentation and compaction maintenance tasks. However, you can use performance counters to track the background maintenance tasks under the MSExchange Database ==> Instances object: Counter Description Database Maintenance Duration The number of seconds that have passed since the maintenance started for this database. If the value is 0, maintenance has been finished for the day. Database Maintenance Pages Bad Checksums The number of non-correctable page checksums encountered during a database maintenance pass Defragmentation Tasks The count of background database defragmentation tasks that are currently executing Defragmentation Tasks Completed/Sec The rate of background database defragmentation tasks that are being completed You'll find the following page zeroing counters under the MSExchange Database object: Counter Description Database Maintenance Pages Zeroed Indicates the number of pages zeroed by the database engine since the performance counter was invoked Database Maintenance Pages Zeroed/sec Indicates the rate at which pages are zeroed by the database engine How can I check whitepace in a database? You will need to dismount the database and use ESEUTIL /MS to check the available whitespace in a database. For an example, see http://technet.microsoft.com/en-us/library/aa996139(v=EXCHG.65).aspx (note that you have to multiply the number of pages by 32K). Note that there is a status property available on databases within Exchange 2010, but it should not be used to determine the amount of total whitespace available within the database: Get-MailboxDatabase MDB1 -Status | FL AvailableNewMailboxSpace AvailableNewMailboxSpace tells you is how much space is available in the root tree of the database. It does not factor in the free pages within mailbox tables, index tables, etc. It is not representative of the white space within the database. How can I reclaim the whitespace? Naturally, after seeing the available whitespace in the database, the question that always ensues is – how can I reclaim the whitespace? Many assume the answer is to perform an offline defragmentation of the database using ESEUTIL. However, that's not our recommendation. When you perform an offline defragmentation you create an entirely brand new database and the operations performed to create this new database are not logged in transaction logs. The new database also has a new database signature, which means that you invalidate the database copies associated with this database. In the event that you do encounter a database that has significant whitespace and you don't expect that normal operations will reclaim it, our recommendation is: Create a new database and associated database copies. Move all mailboxes to the new database. Delete the original database and its associated database copies. A terminology confusion Much of the confusion lies in the term background database maintenance. Collectively, all of the aforementioned tasks make up background database maintenance. However, the Shell, EMC , and JetStress all refer to database checksumming as background database maintenance, and that's what you're configuring when you enable or disable it using these tools. Figure 1: Enabling background database maintenance for a database using EMC Enabling background database maintenance using the Shell: Set-MailboxDatabase -Identity MDB1 -BackgroundDatabaseMaintenance $true Figure 2: Running background database maintenance as part of a JetStress test My storage vendor has recommended I disable Database Checksumming as a background maintenance task, what should I do? Database checksumming can become an IO tax burden if the storage is not designed correctly (even though it's sequential) as it performs 256K read IOs and generates roughly 5MB/s per database. As part of our storage guidance, we recommend you configure your storage array stripe size (the size of stripes written to each disk in an array; also referred to as block size) to be 256KB or larger. It's also important to test your storage with JetStress and ensure that the database checksum operation is included in the test pass. In the end, if a JetStress execution fails due to database checksumming, you have a few options: Don’t use striping Use RAID-1 pairs or JBOD (which may require architectural changes) and get the most benefit from sequential IO patterns available in Exchange 2010. Schedule it Configure database checksumming to not be a background process, but a scheduled process. When we implemented database checksum as a background process, we understood that some storage arrays would be so optimized for random IO (or had bandwidth limitations) that they wouldn't handle the sequential read IO well. That's why we built it so it could be turned off (which moves the checksum operation to the maintenance window). If you do this, we do recommend smaller database sizes. Also keep in mind that the passive copies will still perform database checksum as a background process, so you still need to account for this throughput in our storage architecture. For more information on this subject see Jetstress 2010 and Background Database Maintenance. Use different storage or improve the capabilities of the storage Choose storage which is capable of meeting Exchange best practices (256KB+ stripe size). Conclusion The architectural changes to the database engine in Exchange Server 2010 dramatically improve its performance and robustness, but change the behavior of database maintenance tasks from previous versions. Hopefully this article helps your understanding of what is background database maintenance in Exchange 2010. Ross Smith IV Principal Program Manager Exchange Customer Experience207KViews0likes30CommentsWindows Server 2019 Build 17623 SMB1 does no more exists?
Hello, I tried for a test to enable SMB1 support for the build 17623, but when I give the PowerShell command: Set-SmbServerConfiguration -EnableSMB1Protocol $true I have the error: Set-SmbServerConfiguration : The specified service does not exist. At line:1 char:1 + Set-SmbServerConfiguration -EnableSMB1Protocol $true + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (MSFT_SmbServerConfiguration:ROOT/Microsoft/...erConfiguration) [Set-SmbSe rverConfiguration], CimException + FullyQualifiedErrorId : Windows System Error 1243,Set-SmbServerConfiguration Could you confirm that it is no more possible to enable SMB1? I know it is deprecated, but we have still a pair of Windows Server 2003 domain controllers and so it is not possible to join the domain for the server with the build. MarcoSolved200KViews0likes11CommentsTroubleshooting Rapid Growth in Databases and Transaction Log Files in Exchange Server 2007 and 2010
A few years back, a very detailed blog post was released on Troubleshooting Exchange 2007 Store Log/Database growth issues. We wanted to revisit this topic with Exchange 2010 in mind. While the troubleshooting steps needed are virtually the same, we thought it would be useful to condense the steps a bit, make a few updates and provide links to a few newer KB articles. The below list of steps is a walkthrough of an approach that would likely be used when calling Microsoft Support for assistance with this issue. It also provides some insight as to what we are looking for and why. It is not a complete list of every possible troubleshooting step, as some causes are simply not seen quite as much as others. Another thing to note is that the steps are commonly used when we are seeing “rapid” growth, or unexpected growth in the database file on disk, or the amount of transaction logs getting generated. An example of this is when an Administrator notes a transaction log file drive is close to running out of space, but had several GB free the day before. When looking through historical records kept, the Administrator notes that approx. 2 to 3 GBs of logs have been backed up daily for several months, but we are currently generating 2 to 3 GBs of logs per hour. This is obviously a red flag for the log creation rate. Same principle applies with the database in scenarios where the rapid log growth is associated to new content creation. In other cases, the database size or transaction log file quantity may increase, but signal other indicators of things going on with the server. For example, if backups have been failing for a few days and the log files are not getting purged, the log file disk will start to fill up and appear to have more logs than usual. In this example, the cause wouldn’t necessarily be rapid log growth, but an indicator that the backups which are responsible for purging the logs are failing and must be resolved. Another example is with the database, where retention settings have been modified or online maintenance has not been completing, therefore, the database will begin to grow on disk and eat up free space. These scenarios and a few others are also discussed in the “Proactive monitoring and mitigation efforts” section of the previously published blog. It should be noted that in some cases, you may run into a scenario where the database size is expanding rapidly, but you do not experience log growth at a rapid rate. (As with new content creation in rapid log growth, we would expect the database to grow at a rapid rate with the transaction logs.) This is often referred to as database “bloat” or database “space leak”. The steps to troubleshoot this specific issue can be a little more invasive as you can see in some analysis steps listed here (taking databases offline, various kinds of dumps, etc.), and it may be better to utilize support for assistance if a reason for the growth cannot be found. Once you have established that the rate of growth for the database and transaction log files is abnormal, we would begin troubleshooting the issue by doing the following steps. Note that in some cases the steps can be done out of order, but the below provides general suggested guidance based on our experiences in support. Step 1 Use Exchange User Monitor (Exmon) server side to determine if a specific user is causing the log growth problems. Sort on CPU (%) and look at the top 5 users that are consuming the most amount of CPU inside the Store process. Check the Log Bytes column to verify for this log growth for a potential user. If that does not show a possible user, sort on the Log Bytes column to look for any possible users that could be attributing to the log growth If it appears that the user in Exmon is a ?, then this is representative of a HUB/Transport related problem generating the logs. Query the message tracking logs using the Message Tracking Log tool in the Exchange Management Consoles Toolbox to check for any large messages that might be running through the system. See #15 for a PowerShell script to accomplish the same task. Step 2 With Exchange 2007 Service Pack 2 Rollup Update 2 and higher, you can use KB972705 to troubleshoot abnormal database or log growth by adding the described registry values. The registry values will monitor RPC activity and log an event if the thresholds are exceeded, with details about the event and the user that caused it. (These registry values are not currently available in Exchange Server 2010) Check for any excessive ExCDO warning events related to appointments in the application log on the server. (Examples are 8230 or 8264 events). If recurrence meeting events are found, then try to regenerate calendar data server side via a process called POOF. See http://blogs.msdn.com/stephen_griffin/archive/2007/02/21/poof-your-calender-really.aspx for more information on what this is. Event Type: Warning Event Source: EXCDO Event Category: General Event ID: 8230 Description: An inconsistency was detected in username@domain.com: /Calendar/<calendar item> .EML. The calendar is being repaired. If other errors occur with this calendar, please view the calendar using Microsoft Outlook Web Access. If a problem persists, please recreate the calendar or the containing mailbox. Event Type: Warning Event ID : 8264 Category : General Source : EXCDO Type : Warning Message : The recurring appointment expansion in mailbox <someone's address> has taken too long. The free/busy information for this calendar may be inaccurate. This may be the result of many very old recurring appointments. To correct this, please remove them or change their start date to a more recent date. Important: If 8230 events are consistently seen on an Exchange server, have the user delete/recreate that appointment to remove any corruption Step 3 Collect and parse the IIS log files from the CAS servers used by the affected Mailbox Server. You can use Log Parser Studio to easily parse IIS log files. In here, you can look for repeated user account sync attempts and suspicious activity. For example, a user with an abnormally high number of sync attempts and errors would be a red flag. If a user is found and suspected to be a cause for the growth, you can follow the suggestions given in steps 5 and 6. Once Log Parser Studio is launched, you will see convenient tabs to search per protocol: Some example queries for this issue would be: Step 4 If a suspected user is found via Exmon, the event logs, KB972705, or parsing the IIS log files, then do one of the following: Disable MAPI access to the users mailbox using the following steps (Recommended): Run Set-Casmailbox –Identity <Username> –MapiEnabled $False Move the mailbox to another Mailbox Store. Note: This is necessary to disconnect the user from the store due to the Store Mailbox and DSAccess caches. Otherwise you could potentially be waiting for over 2 hours and 15 minutes for this setting to take effect. Moving the mailbox effectively kills the users MAPI session to the server and after the move, the users access to the store via a MAPI enabled client will be disabled. Disable the users AD account temporarily Kill their TCP connection with TCPView Call the client to have them close Outlook or turn of their mobile device in the condition state for immediate relief. Step 5 If closing the client/devices, or killing their sessions seems to stop the log growth issue, then we need to do the following to see if this is OST or Outlook profile related: Have the user launch Outlook while holding down the control key which will prompt if you would like to run Outlook in safe mode. If launching Outlook in safe mode resolves the log growth issue, then concentrate on what add-ins could be attributing to this problem. For a mobile device, consider a full resync or a new sync profile. Also check for any messages in the drafts folder or outbox on the device. A corrupted meeting or calendar entry is commonly found to be causing the issue with the device as well. If you can gain access to the users machine, then do one of the following: 1. Launch Outlook to confirm the log file growth issue on the server. 2. If log growth is confirmed, do one of the following: Check users Outbox for any messages. If user is running in Cached mode, set the Outlook client to Work Offline. Doing this will help stop the message being sent in the outbox and sometimes causes the message to NDR. If user is running in Online Mode, then try moving the message to another folder to prevent Outlook or the HUB server from processing the message. After each one of the steps above, check the Exchange server to see if log growth has ceased Call Microsoft Product Support to enable debug logging of the Outlook client to determine possible root cause. 3. Follow the Running Process Explorer instructions in the below article to dump out dlls that are running within the Outlook Process. Name the file username.txt. This helps check for any 3rd party Outlook Add-ins that may be causing the excessive log growth. 970920 Using Process Explorer to List dlls Running Under the Outlook.exe Process http://support.microsoft.com/kb/970920 4. Check the Sync Issues folder for any errors that might be occurring Let’s attempt to narrow this down further to see if the problem is truly in the OST or something possibly Outlook Profile related: Run ScanPST against the users OST file to check for possible corruption. With the Outlook client shut down, rename the users OST file to something else and then launch Outlook to recreate a new OST file. If the problem does not occur, we know the problem is within the OST itself. If renaming the OST causes the problem to recur again, then recreate the users profile to see if this might be profile related. Step 6 Ask Questions: Is the user using any type of devices besides a mobile device? Question the end user if at all possible to understand what they might have been doing at the time the problem started occurring. It’s possible that a user imported a lot of data from a PST file which could cause log growth server side or there was some other erratic behavior that they were seeing based on a user action. Step 7 Check to ensure File Level Antivirus exclusions are set correctly for both files and processes per http://technet.microsoft.com/en-us/library/bb332342(v=exchg.141).aspx Step 8 If Exmon and the above methods do not provide the data that is necessary to get root cause, then collect a portion of Store transaction log files (100 would be a good start) during the problem period and parse them following the directions in http://blogs.msdn.com/scottos/archive/2007/11/07/remix-using-powershell-to-parse-ese-transaction-logs.aspx to look for possible patterns such as high pattern counts for IPM.Appointment. This will give you a high level overview if something is looping or a high rate of messages being sent. Note: This tool may or may not provide any benefit depending on the data that is stored in the log files, but sometimes will show data that is MIME encoded that will help with your investigation Step 9 If nothing is found by parsing the transaction log files, we can check for a rogue, corrupted, and large message in transit: 1. Check current queues against all HUB Transport Servers for stuck or queued messages: get-exchangeserver | where {$_.IsHubTransportServer -eq "true"} | Get-Queue | where {$_.Deliverytype –eq “MapiDelivery”} | Select-Object Identity, NextHopDomain, Status, MessageCount | export-csv HubQueues.csv Review queues for any that are in retry or have a lot of messages queued: Export out message sizes in MB in all Hub Transport queues to see if any large messages are being sent through the queues: get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-messa ge –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,@{Name="Message Size MB";expression={$_.size.toMB()}} | sort-object -property size –descending | export-csv HubMessages.csv Export out message sizes in Bytes in all Hub Transport queues: get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-message –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,size | sort-object -property size –descending | export-csv HubMessages.csv 2. Check Users Outbox for any large, looping, or stranded messages that might be affecting overall Log Growth. get-mailbox -ResultSize Unlimited| Get-MailboxFolderStatistics -folderscope Outbox | Sort-Object Foldersize -Descending | select-object identity,name,foldertype,itemsinfolder,@{Name="FolderSize MB";expression={$_.folderSize.toMB()}} | export-csv OutboxItems.csv Note: This does not get information for users that are running in cached mode. Step 10 Utilize the MSExchangeIS Client\Jet Log Record Bytes/sec and MSExchangeIS Client\RPC Operations/sec Perfmon counters to see if there is a particular client protocol that may be generating excessive logs. If a particular protocol mechanism if found to be higher than other protocols for a sustained period of time, then possibly shut down the service hosting the protocol. For example, if Exchange Outlook Web Access is the protocol generating potential log growth, then stopping the World Wide Web Service (W3SVC) to confirm that log growth stops. If log growth stops, then collecting IIS logs from the CAS/MBX Exchange servers involved will help provide insight in to what action the user was performing that was causing this occur. Step 11 Run the following command from the Management shell to export out current user operation rates: To export to CSV File: get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount, progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| export-csv LogonStats.csv To view realtime data: get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount, progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| ft Key things to look for: In the below example, the Administrator account was storming the testuser account with email. You will notice that there are 2 users that are active here, one is the Administrator submitting all of the messages and then you will notice that the Windows2000Account references a HUB server referencing an Identity of testuser. The HUB server also has *no* UserName either, so that is a giveaway right there. This can give you a better understanding of what parties are involved in these high rates of operations UserName : Administrator Windows2000Account : DOMAIN\Administrator Identity : /o=First Organization/ou=First Administrative Group/cn=Recipients/cn=Administrator MessagingOperationCount : 1724 OtherOperationCount : 384 ProgressOperationCount : 0 StreamOperationCount : 0 TableOperationCount : 576 TotalOperationCount : 2684 UserName : Windows2000Account : DOMAIN\E12-HUB$ Identity : /o= First Organization/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=testuser MessagingOperationCount : 630 OtherOperationCount : 361 ProgressOperationCount : 0 StreamOperationCount : 0 TableOperationCount : 0 TotalOperationCount : 1091 Step 12 Enable Perfmon/Perfwiz logging on the server. Collect data through the problem times and then review for any irregular activities. You can reference Perfwiz for Exchange 2007/2010 data collection here http://blogs.technet.com/b/mikelag/archive/2010/07/09/exchange-2007-2010-performance-data-collection-script.aspx Step 13 Run ExTRA (Exchange Troubleshooting Assistant) via the Toolbox in the Exchange Management Console to look for any possible Functions (via FCL Logging) that may be consuming Excessive times within the store process. This needs to be launched during the problem period. http://blogs.technet.com/mikelag/archive/2008/08/21/using-extra-to-find-long-running-transactions-inside-store.aspx shows how to use FCL logging only, but it would be best to include Perfmon, Exmon, and FCL logging via this tool to capture the most amount of data. The steps shown are valid for Exchange 2007 & Exchange 2010. Step 14 Export out Message tracking log data from affected MBX server. Method 1 Download the ExLogGrowthCollector script (attached to this blog post) and place it on the MBX server that experienced the issue. Run ExLogGrowthCollector.ps1 from the Exchange Management Shell. Enter in the MBX server name that you would like to trace, the Start and End times and click on the Collect Logs button. Note: What this script does is to export out all mail traffic to/from the specified mailbox server across all HUB servers between the times specified. This helps provide insight in to any large or looping messages that might have been sent that could have caused the log growth issue. Method 2 Copy/Paste the following data in to notepad, save as msgtrackexport.ps1 and then run this on the affected Mailbox Server. Open in Excel for review. This is similar to the GUI version, but requires manual editing to get it to work. #Export Tracking Log data from affected server specifying Start/End Times Write-host "Script to export out Mailbox Tracking Log Information" Write-Host "#####################################################" Write-Host $server = Read-Host "Enter Mailbox server Name" $start = Read-host "Enter start date and time in the format of MM/DD/YYYY hh:mmAM" $end = Read-host "Enter send date and time in the format of MM/DD/YYYY hh:mmPM" $fqdn = $(get-exchangeserver $server).fqdn Write-Host "Writing data out to csv file..... " Get-ExchangeServer | where {$_.IsHubTransportSer ver -eq "True" -or $_.name -eq "$server"} | Get-MessageTrackingLog -ResultSize Unlimited -Start $start -End $end | where {$_.ServerHostname -eq $server -or $_.clienthostname -eq $server -or $_.clienthostname -eq $fqdn} | sort-object totalbytes -Descending | export-csv MsgTrack.csv -NoType Write-Host "Completed!! You can now open the MsgTrack.csv file in Excel for review" Method 3 You can also use the Process Tracking Log Tool at http://blogs.technet.com/b/exchange/archive/2011/10/21/updated-process-tracking-log-ptl-tool-for-use-with-exchange-2007-and-exchange-2010.aspx to provide some very useful reports. Step 15 Save off a copy of the application/system logs from the affected server and review them for any events that could attribute to this problem. Step 16 Enable IIS extended logging for CAS and MB server roles to add the sc-bytes and cs-bytes fields to track large messages being sent via IIS protocols and to also track usage patterns (Additional Details). Step 17 Get a process dump the store process during the time of the log growth. (Use this as a last measure once all prior activities have been exhausted and prior to calling Microsoft for assistance. These issues are sometimes intermittent, and the quicker you can obtain any data from the server, the better as this will help provide Microsoft with information on what the underlying cause might be.) Download the latest version of Procdump from http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx and extract it to a directory on the Exchange server Open the command prompt and change in to the directory which procdump was extracted in the previous step. Type procdump -mp -s 120 -n 2 store.exe d:\DebugData This will dump the data to D:\DebugData. Change this to whatever directory has enough space to dump the entire store.exe process twice. Check Task Manager for the store.exe process and how much memory it is currently consuming for a rough estimate of the amount of space that is needed to dump the entire store dump process. Important: If procdump is being run against a store that is on a clustered server, then you need to make sure that you set the Exchange Information Store resource to not affect the group. If the entire store dump cannot be written out in 300 seconds, the cluster service will kill the store service ruining any chances of collecting the appropriate data on the server. Open a case with Microsoft Product Support Services to get this data looked at. Most current related KB articles 2814847 - Rapid growth in transaction logs, CPU use, and memory consumption in Exchange Server 2010 when a user syncs a mailbox by using an iOS 6.1 or 6.1.1-based device 2621266 - An Exchange Server 2010 database store grows unexpectedly large 996191 - Troubleshooting Fast Growing Transaction Logs on Microsoft Exchange 2000 Server and Exchange Server 2003 Kevin Carker (based on a blog post written by Mike Lagase)167KViews0likes5Comments