Exchange 2007 Service Pack 1 introduces several changes in the Extensible Storage Engine (ESE). Two of these are:
- Removal of page dependencies
- Disablement of partial merges
- Atomic - This means that a transaction state change is all or nothing.
- Consistent - This means that a transaction must preserve the consistency of the data (i.e. transforming it from one consistent state to the next).
- Isolated - This means that a transaction is a unit of operation and each assumes it is the only transaction occurring, even though multiple transactions occur at the same time.
- Durable - This means that committed transactions are preserved in the database, even if the system stops responding.
- Transaction Logs - Transaction log files record the operations that are performed against the database, to ensure durability in case of system failure.
- Checkpoint File - The checkpoint file is waypoint in the log sequence that indicates where in the log stream ESE needs to start a recovery in case of a failure.
- Checkpoint Depth - The checkpoint depth is a threshold that defines when ESE begins to aggressively flush dirty pages to the database.
- ESE Cache - An area of memory reserved to the Information Store process (Store.exe) that is used to store database pages in memory. By storing pages in memory, this can reduce read I/Os, especially when the page is accessed many times; in addition, this cache can be used in two ways to reduce write I/Os - by storing the dirty page in memory, there is the potential for multiple changes to be made to the page before it is flushed to disk; also, the engine can write multiple pages together (up to 1MB) using write coalescing.
- Log Buffers - Operations performed against the database are first held in memory before they are written to transaction logs.
- Version Store - An in-memory list of modifications that are made to the database. The version store is used for roll-back of transactions, for write-conflict detection, and for other tasks. The version store entry for a specific operation is cleaned up only one time when the following conditions are true:
- The transaction that owns the operation has been either committed or rolled back.
- All the transactions that are outstanding at the time that the operation was performed have been either committed or rolled back.
- EDB File - Exchange database (.edb) files are the repository for mailbox (or public folder) data. Within the database, data is stored in 8 KB database pages, which in turn are stored in a collection of balanced trees, known as B+ trees. A B+ tree, is a multi-level index where all records are stored in leaf nodes, which are logically linked together to allow efficient searching and sorting of data. This allows a B+ tree to promote a persistent arrangement of data and to be able to locate and retrieve information from the disk quickly.
- An operation occurs against the database (e.g. client sends a new message), and the page that requires updating is read from the file and placed into the ESE cache (if it is not already in memory), while the log buffer is notified and records the operation in memory.
- The changes are recorded by the database engine but are not immediately written to disk; instead, these changes are held in the ESE cache and are known as "dirty" pages because they have not been committed to the database. The version store is used to keep track of these changes, thus ensuring isolation and consistency are maintained.
- As the database pages are changed, the log buffer is notified to commit the change, and the transaction is recorded in a transaction log file (which may or may not require a log roll and a start of a new log generation).
- Eventually the dirty database pages are flushed to the database file.
- The checkpoint is advanced.
|
Page 13 |
Page 14 |
Recovery Action |
|
A |
<blank> |
Move A from 13 to 14 |
|
<blank> |
A |
Nothing - A has already been moved |
|
A |
A |
Delete A from page 13 |
|
<blank> |
<blank> |
Disaster - record A has been lost! |
- Pages must be written to disk serially and the I/O cannot be coalesced. Not only does this have implications in terms of maintaining ACID, but it also has implications in terms of I/O. Each preceding page I/O must complete before the next one begins. It also means that if two (or more) consecutive pages have dependencies between them, they cannot be written in a single I/O. Multiple separate I/Os have to be performed serially.
- During streaming backup, page dependencies prevent some pages from being flushed to disk. Suppose record A is moved from page 13, which has not been backed up yet, to page 14 which has been backed up. The backed-up copy of page 14 doesn't contain the record so page 13 should not be written to disk (remember the logged operation only records the move, not the data moved). If page 13 was written, the backup would copy that image of page 13 and end up containing an image of page 13 without the record, making recovery impossible. Older versions of Exchange dealt with this problem by writing pages to the patch file, but the current version simply avoids flushing page 13 until the backup is finished. This in turn prevents the checkpoint from advancing, causing JET_errCheckpointDepthTooDeep problems if a backup job hangs.
- Recovering a single page cannot be done in isolation. At first glance restoring one page from a tape backup of an Exchange server looks easy - read the desired page from tape and then apply all the logged updates to that page. Unfortunately page dependencies make that impossible. Imagine a case where page 14 is being restored from tape and ESE encounters a logged update which says "Move record A from page 13 to page 14". Page 13 is now required to do the recovery - not the current version of page 13 (which doesn't contain record A) but an image of page 13 as it was at that point in time. This requires simultaneously restoring page 13 and page 14. In turn, page 13 might be the target of a record move so other pages will have to be restored as well. In general all the log files would have to be first restored and analyzed before single-page restore could be attempted. This would be incredibly complicated and require a lot of time and disk space, and is currently not something that can be done with Exchange 2007 (or any previous version).
- High memory requirements per storage group. During online defragmentation, ESE reorganizes the database structure, performing thousands of page merges and page splits in the attempt to free up pages. Remember that the data is not moved, we just reference the location. As a result, huge dependency trees are generated from the thousands upon thousands of operations that are performed during the maintenance cycle. This requires additional memory to hold all the dirty cache. To account for this, our RTM memory requirements were a minimum of 512 MB per storage group.
- By removing page dependencies we have improved database I/O characteristics. No longer do we have to write operations in a serial order since pages do not depend upon one another. This improves our database write I/O - we can now coalesce multiple writes together, thus reducing the number of I/Os, while increasing the write size.
- By removing page dependencies, we have reduced our minimum memory requirements per storage group. Now that we have removed page dependencies, we no longer generate large dependency trees during online defragmentation, and as a result, we have changed our memory guidance to require about 300MB per storage group. For more information on the minimum memory requirements, please take a look at http://technet.microsoft.com/en-us/library/bb738124.aspx.
- By removing page dependencies, we have massively reduced the likelihood of hitting the JET_errCheckpointDepthTooDeep condition associated with hung backups. Suppose record A is moved from page 13, which has not been backed up yet, to page 14, which has been backed up. The backed-up copy of page 14 doesn't contain the record, but since we logged page 13 to perform the move to page 14, we can flush the updated page 13 to disk. Even though the backup will only contain a copy of page 13 without the record, because the logs contain the actual data moved, we can now ensure that recovery can happen and not lose any data. As a result, the checkpoint can advance because the pages can be written to disk.
- Removing page dependencies means there are fewer dirty pages pinned in the cache which saves CPU resources and simplifies the job of the buffer manager. With page dependencies gone, the buffer manager can flush dirty pages when it sees fit, instead of having to walk the dependency tree to see if the flush can actually occur (in other words, we do not have to check to see if another page has to be written first due to page dependency). Walking the dependency tree is processor-expensive and gets even more expensive based on dependency tree depth (which is a function of the LLR waypoint depth).
- Removing page dependencies allows us to further evolve the product in ways that are particularly beneficial for continuous replication and reseed operations. Future versions (beyond Exchange 2007) may be able to take advantage of database page patching.
- Backup times. The more logs generated, the longer it will take to complete backups.
- Log capacity requirements. The more logs generated, the more capacity is required.
|
4 week stress test |
Partial Merges Enabled |
Partial Merges Disabled |
|
DB Size pre-defrag |
162,588,409,856 |
171,188,961,280 |
|
% Available Pages |
1.80% |
1.33% |
|
DB Size Post-defrag |
158,247,026,688 |
163,652,911,104 |
|
% Difference |
2.65% |
4.45% |
|
3000 Mailbox server (8hours) |
Partial Merges Enabled |
Partial Merges Disabled |
|
OLD Page Reads |
56,160,000 |
57,830,400 |
|
OLD Pages Dirtied |
3,830,400 |
691,200 |
|
OLD Pages Freed |
392,457 |
201,600 |
|
BTree Partial Merges |
2,494,080 |
0 |
|
Database Churn (GB) |
30 |
5.5 |
|
Log Files Generated |
18,269 |
13,701 |
|
Mailbox profile |
Message profile |
Logs generated / mailbox / day |
|
Light |
5 sent/20 received |
7 |
|
Average |
10 sent/40 received |
14 |
|
Heavy |
20 sent/80 received |
28 |
|
Very heavy |
30 sent/120 received |
42 |
|
Mailbox profile |
Message profile |
Logs generated / mailbox / day |
|
Light |
5 sent/20 received |
6 |
|
Average |
10 sent/40 received |
12 |
|
Heavy |
20 sent/80 received |
24 |
|
Very heavy |
30 sent/120 received |
36 |
- No more checkpoint too deep errors as a result of hung backups.
- Reduced memory requirements per storage group.
- Smaller snapshot / differential backups.
- A more efficient dirty page flushing model for the buffer manager, which improves write I/Os and reduces CPU cycles.
- Future versions of Exchange may be able to evolve recovery and/or reseed operations.
- Updated guidance on log generation numbers per message profile.