Exchange 2007 Service Pack 1 introduces several changes in the Extensible Storage Engine (ESE). In my first blog article on the subject, I discussed the removal of page dependencies and the disabling of partial merges. In this blog, I discuss other changes we made to ESE which enhance Exchange.
- Passive Node I/O improvements
- Online Defragmentation
- Checksumming databases
- Page Zeroing
Passive Node I/O Improvements
When we released Exchange 2007, the I/O generated on the passive node in a CCR environment was ~2-3 times the I/O generated on the active node. For example, consider a server which contains 3500 very heavy profile mailboxes. The active node generated about 1300 IOPS, while the passive node generated about 3500 IOPS, even though the only activity occurring on the passive node is log inspection and replay!
The extra I/O on the passive node was the result of the design of the replay function. In RTM, when log replay starts on the passive node, an instance of ESE is started and used to replay the replicated logs. During this replay activity, pages are read from the database, which in turn populates the ESE database cache. When log replay has finished, the Replication Services stops and discards the database instance, thereby deleting database cache that was built up during the replay process. As a result, we see spikes in activity. When log replay is not behind, the cache will continually be small, and thus more read I/Os will occur against the passive node's disks. When log replay gets behind, the instance of ESE will remain active longer, and we obtain a larger database cache, which decreases read I/Os.
By itself, the additional disk I/O on the passive node is not a problem. But there are two scenarios where this additional I/O can have a significant impact - backups and storage design. Consider the scenario where you are performing VSS backups on the passive node. The additional disk I/O on the passive node can interfere with your ability to take a backup during the core user activity window, thus forcing you to schedule backups at off-hours. This negates the advantage of being able to backup storage groups on the passive node.
The other scenario where the additional I/O is when storage is being shared. Consider these scenarios:
- You are sharing disk spindles with multiple mailbox servers.
- You are sharing the storage controller with multiple mailbox servers.
In either case, you may have passive nodes mixed with active nodes. If you did not design the storage to handle this additional I/O that occurs on the passive node, the active node and the connecting end users could suffer.
To improve performance in SP1, we changed this design of the system. In SP1, the Replication Service does not discard the database instance between replay batches. Instead, it simply pauses the instance until the next batch of logs is ready to be replayed. This change has several benefits. It:
- Allows the checkpoint to advance during recovery.
- Keeps the database cache "warm", which improves failover times.
- Allows the database cache to grow, which reduces read I/Os.
- Ensures there is no competing I/O that affects when you perform backups against the passive node.
As a result of this change, we now see the passive node's I/O to be .5 to 1 times that of the active node! That is a huge reduction in I/O when compared to the passive node in RTM. Here's a graph of the aforementioned server with SP1.
Online Defragmentation
If you read my previous blog article on the ESE SP1 changes, you know that we reduced database churn significantly by disabling partial merges during online defragmentation (OLD). However, one other challenge remained - the challenge of determining how often you should run online defragmentation. Because Exchange had no metrics that could determine how often OLD should be run, our guidance always been to make sure that online defragmentation completes every week or every two weeks.
Because each environment is different, this guidance was not optimal for every organization. We could not say with absolute certainty that completing OLD every week or every two weeks would be acceptable in every environment. Furthermore, there was always the question of how much time you should allocate to online maintenance.
To address this in SP1, we added logic that allows you to determine how often you need to complete OLD. SP1 now includes new performance counters you can collection to determine this:
- MSExchangeDatabase -> Online Defrag Pages Freed/sec
- MSExchangeDatabase -> Online Defrag Pages Read/sec
If you collect the values for these two counters you can determine how often you should perform OLD. If the Read:Freed ratio is greater than 100:1 then the OLD window can be reduced. If the Read:Freed ratio is less than 50:1 then the OLD window should be increased. If your ratio is between 50:1 and 100:1 you don't need to change your OLD window.
In addition, we have added some information to event 703, which is logged when OLD successfully completes:
Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 703
Description:
MSExchangeIS (19052) SG06: Online defragmentation has completed the resumed pass on database 'e:\MDB06\priv06.edb', freeing 42794 pages. This pass started on 6/16/2007 and ran for a total of 124919 seconds, requiring 7 invocations over 4 days. Since the database was created it has been fully defragmented 14 times over 73 days.
The revised event now provides the following new information:
- When the OLD pass started.
- How many passes it took to complete.
- How many times the database has actually had OLD complete since database creation.
This information can also be used to determine if you are completing OLD often enough. For example, if OLD is completing every day or two, you can safely reduce your online maintenance window.
Why would you want to reduce your online maintenance window? Well, for one thing, you can increase your backup window (if you are performing backups on the active copy of a database). In addition, if you decrease your online maintenance window and you are performing VSS backups, this will reduce database churn and produce smaller snapshots or differential backups. Also, by reducing how often you perform online defragmentation, you will gain additional headroom during non-user activity periods so that you can perform database checksumming, which I discuss next.
Database Checksumming
With the release of Exchange 2007, we added a new feature called continuous replication. Continuous replication uses built-in asynchronous log shipping technology to create and maintain a copy of a storage group in a second location. In addition, continuous replication also allows you the freedom to change your backup paradigm. For example:
- You can utilize the streaming backup API to backup the active copy.
- You can utilize Volume Shadow Services (VSS) to backup the active copy.
- You can utilize VSS to backup the passive copy.
Depending on the solution you use, one of the copies may never be checksummed, which is an important process in which the system compares the checksum written on each page of the database to the checksum that is calculated in memory. If the two match, then the page is good and no corruption has occurred. If they do not match, then there is a problem. Certain problems can be corrected (such as single bit errors), but others cannot and will require restoration (or reseed via a healthy copy).
Streaming Backup of Active Copy
The streaming backup API checksums each page of the database as it is backed up. This is typically where the system detects -1018s errors (and similar errors).
Now consider the case where you have continuous replication in your environment. If you are performing a streaming backup, which database are you ensuring is healthy? That's right, the active copy. So while we know the status of the active copy, we do not know the status of the passive copy.
So how can you ensure the health of your passive copy? Well there are few ways:
- Perform a handoff (CCR) or activate the passive copy (LCR) and then perform a backup. While this sounds easy, it is operationally a mess because you have to constantly perform handoffs / activations in order to check each copy.
- Utilize an Exchange-aware VSS requestor. We'll discuss the ramifications of this in a later section.
- Take a snapshot of the passive copy.
For now, let's focus on #3. The process for performing a snapshot of the passive copy is relatively simple. You do not need an Exchange aware requestor to perform this. You can simply use the built-in Windows Server shadowcopy functionality to perform a snapshot. Here's the process:
- Suspend continuous replication for all storage groups hosted on the volume containing the databases to be checksummed.
- Use vssadmin.exe (which is included in Windows Server 2003) to create a shadow copy of the volume containing the databases to be checksummed. e.g. "vssadmin create shadow /for=<volume>"
- Resume continuous replication for all storage groups hosted on the volume.
- Run eseutil /k /p against the database(s) on the shadow copy of the volume) e.g. "eseutil /k /p20 <Path for VSS Shadow Copy of Database>"
- After verification has completed successfully, delete the volume shadow copy. E.g. "vssadmin delete shadow /For=<volume>"
Essentially in this case you are performing an offline backup. You can then validate the integrity of the passive copy by verifying this offline backup.
The one downside to this process is that there is a write performance penalty by performing the checksum since the snapshot is stored on the same LUN as the database (though remember that in SP1, the I/O requirements for the passive node are less or equal to the active node's I/O requirements). You will want to take the snapshot, perform the checksum, and then resume replication in a small time window, otherwise your log queue will grow, which while log replication will catch up once replication is resumed, there is still a chance that a failure could occur, and thus you'd have to deal with a lossy failure.
VSS Backup of Active Copy
In this scenario, a VSS based backup is taken against the active copy of the database. This scenario is very similar to the previously described scenario. The only difference is that VSS does not perform the checksum. Instead, the VSS requestor must either utilize ESEUTIL or CHKSGFILES.DLL (an API included in Exchange 2007 that allows requestors to programmatically request a checksum) to checksum the snapshot.
Like the previous scenario, since the active copy is the one being backed up, the passive copy's health is in question. The same process I described previously can be used to check the health of the passive copy.
Note: Though honestly, I don't know why you would be performing a VSS backup against the active copy when you could offload that process onto the passive copy.
VSS Backup of Passive Copy
This backup model is is very strongly recommended. Backing up a passive copy up allows you to offload the backup process (if using CCR the backup process occurs on passive node of the cluster, thereby removing backup cycles from the active instance all together).
However this scenario also has a gap. If you are backing up the passive copy via VSS, then you will only know that the passive copy is healthy. You will not know if the active copy is healthy. This was a problem in RTM because the only way to determine the health of the active copy was to:
- perform a handoff/ activation and back it up
- perform a streaming copy backup
- perform an offline backup and utilize ESEUTIL
To correct this in SP1, we have introduced a new online maintenance task, Online Maintenance Checksum. It is an optional task that is disabled by default, as it could affect server performance; however, it should be considered if you are performing backups against passive copies because it enables you to ensure that active copies are healthy.
Online Maintenance Checksum utilizes half the online defragmentation time window to scan database pages. Like OLD, database scanning tracks its progress, with updates at regular intervals, so that it can continue where it left off when resuming after an interruption.
Note: For those of you that may be wondering, there are many other tasks executed during the online maintenance window. For a listing of some of those other tasks, please see
http://blogs.msdn.com/jeremyk/archive/2004/06/12/154283.aspx. Online defragmentation typically takes up the majority of the online maintenance window (the other tasks mentioned in the aforementioned blog complete relatively quickly; internally we typically see them complete within 15 minutes).
If the Online Maintenance Checksum process determines that there is corruption, it will notify you via the application event log, utilizing the same events that are generated when you use the streaming backup API.
The Online Maintenance Checksum task can perform very large sequential database reads (320KB). In addition, we have found that you obtain better performance if you stagger online maintenance so that only one database is checked per LUN. Also, in case there are other processes occurring during your online maintenance window, you have the ability to throttle the checksum process so that it only performs the sequential reads every x number of milliseconds. As you can see from the table below, there is a slight processor increase as a result of performing this operation. In addition, while this operation is occurring, you will also see about a 10% increase in RPC Average Latency. As a result, we do not recommend you perform this maintenance task during the peak user activity window.
DB's Checksummed in Parallel
|
DB Pages Read / s
|
DB Read MB / s
|
DB Read Latency (ms)
|
% Processor
|
1
|
30000
|
250
|
3.5
|
7
|
2
|
8200
|
67
|
7.2
|
2.5
|
As you can see from the above table, when only performing the checksum operation against a single database on the LUN, you can achieve a whopping 250MB/s throughput. In MSIT, the majority of our databases complete Online Maintenance Checksum every night. Servers with our largest mailboxes (1GB + average) complete scanning once every three days (4 hours for Online Maintenance Checksum and 4 hours for OLD).
To track the status of the Online Maintenance Checksum task, we have added several new events to the application log. Two events in particular notify you of status. Event 719 notifies you when the process begins, and event 721 notifies you when it completes. Here's an example of event 721:
Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 721
Description:
MSExchangeIS (6584) Third Storage Group: Online Maintenance Database Checksumming background task has completed for database 'J:\sg3\priv3.edb'. This pass started on 6/19/2007 and ran for a total of 208 seconds, requiring 2 invocations over 1 days. Operation summary:
5850768 pages seen
0 bad checksums
72682 uninitialized pages
In addition, we have added two new performance counters as well (which is how we obtained the data for the previous table):
- MSExchange Database -> Online Maintenance (DB Scan) Pages Read
- MSExchange Database -> Online Maintenance (DB Scan) Pages Read/sec
The Online Maintenance Checksum task can be enabled via the registry:
Registry Hive: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\ParametersSystem
DWORD Key: Online Maintenance Checksum
DWORD Value: 1 (enabled), 0 (disabled)
DWORD Key: Throttle Checksum
DWORD Value: <number of milliseconds to sleep between sequential read batches>
For more information, please see:
http://technet.microsoft.com/en-us/library/bb676537.aspx
Online Maintenance Checksum vs ESEUTIL
There is a difference between the checksum process implemented in the Online Checksum task (streaming backup API or the online maintenance task) and the process used by ESEUTIL /K. Fundamentally, they both do the same thing in terms of how they check the pages. But there is a differencein their performance characteristics. The Online Checksum task uses a method known as JetDatabaseScan(). JetDatabaseScan() has a loop that issues a pre-read for 320KB of pages, scans the pages, and optionally sleeps. ESEUTIL /K, on the other hand, issues 1024 64KB read I/Os and when a read completes, the buffer is checksummed and another read is issued. Like the Online Checksum task, ESEUTIL can also optionally sleep for a configurable amount of time after issuing a certain number of reads. But the net difference here is that the Online Checksum task performs very well, and it is kinder on the disk subsystem than ESEUTIL.
Page Zeroing
Page Zeroing is a security option that allows empty pages to be overwritten using a pattern based on where the page sits within the B+ tree, so that deleted data cannot be recovered. With Exchange 2007 RTM and all previous versions, page zeroing operations happened during the streaming backup process. In addition since they occurred during the streaming backup process they were not a logged operation (e.g., page zeroing did not result in the generation of log files).
This poses a problem in Exchange 2007 when you are using continuous replication, or when you are performing VSS backups. In the continuous replication scenario, the passive copy would never get its empty pages zeroed, and the active copy would only have its pages zeroed if you performed a streaming backup. In the VSS backup scenario, the database (active or passive copy) would never get its pages zeroed.
To address these scenarios in SP1, we have introduced a new online maintenance task, Zero Database Pages During Checksum. It is an optional task that is disabled by default, as it could affect server performance. This is a logged operation that will get replicated to the passive copy, thereby ensuring that both database copies are updated.
Online Page Zeroing, like the Online Maintenance Checksum task, performs large sequential reads (320KB), but is different from the Online Maintenance Checksum process in that it also generates random database writes (160KB).
Fortunately, even if you enable both tasks, there is only a single database scan task in which both page zeroing and Online Maintenance Checksumming are done when either one is enabled (and incidentally, you have to enable the Online Maintenance Checksum task in order to enable Online Page Zeroing). This one task will retrieve the page from disk and perform both operations.
As with Online Maintenance Checksumming, Online Page Zeroing performs better when you stagger online maintenance so that only one database is checked per LUN. However, there are a few things to keep in mind.
- When you initially enable Online Page Zeroing, the scan can place tremendous pressure on the database cache. To ensure this does not affect your server's performance, we recommend that you either implement the Throttle Checksum registry entry (mentioned in the previous section) or stagger your online maintenance window. Once the initial pass is completed, subsequent passes are much less intensive and will not impact the database cache significantly. Therefore, as a best practice, if you require page zeroing, consider enabling page zeroing on the database at creation time so that you will never have this first pass performance spike.
- Online Page Zeroing is very similar to a streaming backup (with page zeroing enabled). It reads from and writes to the database. Reads are sequential, but the writes are random. In addition, there is a slight processor use increase as a result, as well as about a 20% increase in RPC average latency while the database scan is occurring. As always, the best practice is to not execute online maintenance during the peak user activity window, and this is still true for the Online Page Zeroing task.
DB's Page Zeroed in Parallel
|
DB Pages Zeroed / s
|
DB Read MB / s
|
DB Write MB / s
|
DB Read Latency (ms)
|
% Processor
|
1
|
8100
|
68
|
66
|
3.4
|
7.5
|
2
|
6800
|
65
|
50
|
7.2
|
2.5
|
Remember that this task is a logged operation so watch your log capacity. Enabling the Online Page Zeroing task will utilize half of the OLD window (50% for Page Zeroing/Checksumming, 50% for OLD), so the logs generated by OLD per maintenance period will be roughly half of what they were without page zeroing enabled (assuming the same overall maintenance window).
Page Zeroing does only log the page references that need to be zeroed and does not log the entire page (OLD writes
do log the entire page in SP1). We pack anywhere from 2.5 K to 10 K zeroed page references per log. If the database cache is mostly clean, we ratchet up the number of page references/log; if the cache is mostly dirty, we ratchet down the number. The design ensures that page zeroing will not overwhelm the database cache (dirtying too much of the database cache) while reducing the number of logs generated (and subsequently replicated in continuous replication environments).
Within MSIT we have page zeroing enabled on a servers with about 800 ~1GB mailboxes. On this server we do roughly the same amount of log generation during the page zeroing maintenance period as we do during the OLD period (1000 logs/hour). On servers that follow our memory guidance of 2-5MB/mailbox, as a result of the tighter packed logs for the page zeroing operations and the fact that the OLD window is reduced the log generation overhead of enabling page zeroing is more or less mitigated.
The one scenario where log generation overhead becomes a factor is when the new tasks are enabled for the first time on an existing database. In this scenario, the log generation overhead can be 2-5x until the initial pass is finished. Afterwards, there should be no overhead.
To track the status of the Online Page Zeroing task, we have added some new events to the application log. Two events in particular notify you of status. Event 718 notifies you when the process begins, and event 722 notifies you when it completes. Here's an example of event 722:
Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 722
Description:
MSExchangeIS (6544) Third Storage Group: Online Maintenance Database Zeroing background task has completed for database 'J:\sg3\priv3.edb'. This pass started on 6/20/2007 and ran for a total of 369 seconds, requiring 1 invocations over 1 days. Operation summary:
5850768 pages seen
0 bad checksums
72681 uninitialized pages
4379723 pages unchanged since last zero
33759 unused pages zeroed
1210764 used pages seen
57214 deleted records zeroed
0 unreferenced data chunks zeroed
In addition, we have added two new performance counters as well (which is how we obtained the data for the previous table):
- MSExchange Database -> Online Maintenance (DB Scan) Pages Zeroed
- MSExchange Database -> Online Maintenance (DB Scan) Pages Zeroed/sec
The Online Page Zeroing task can be enabled via the registry. Please note that you have to enable the Online Maintenance Checksum task in order to enable the Online Page Zeroing task.
Location: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\ParametersSystem
DWORD Key: Online Maintenance Checksum
DWORD Value: 1 (enabled), 0 (disabled)
DWORD Key: Zero Database Pages During Checksum
DWORD Value: 1 (enabled), 0 (disabled)
DWORD Key: Throttle Checksum
DWORD Value: <number of milliseconds to sleep between sequential read batches>
For more information, please see:
http://technet.microsoft.com/en-us/library/bb676537.aspx
Conclusion
To summarize, SP1 enhances Microsoft Exchange manageability by allowing you to ensure that all databases are healthy, to measure when you should perform online defragmentation, and to ensure that page zeroing activity is replicated to each copy of every database.
- Ross Smith IV