Exchange Server 2007 SP1 ESE Changes - Part 1

Platinum Contributor

Nov 30, 2007

Exchange 2007 Service Pack 1 introduces several changes in the Extensible Storage Engine (ESE). Two of these are:

Removal of page dependencies
Disablement of partial merges

But first, in order to understand what is discussed below, we should have a brief discussion on ESE. ESE Architecture As you are well aware, operations that occur within Exchange are written to the transaction logs, changes are made to the database pages in memory, and then eventually those changes are flushed to the database file. Thus, ESE utilizes a transactional architecture; specifically ESE follows the ACID methodology. ACID is an acronym and means:

Atomic - This means that a transaction state change is all or nothing.
Consistent - This means that a transaction must preserve the consistency of the data (i.e. transforming it from one consistent state to the next).
Isolated - This means that a transaction is a unit of operation and each assumes it is the only transaction occurring, even though multiple transactions occur at the same time.
Durable - This means that committed transactions are preserved in the database, even if the system stops responding.

The Extensible Storage Engine utilizes several components:

Transaction Logs - Transaction log files record the operations that are performed against the database, to ensure durability in case of system failure.
Checkpoint File - The checkpoint file is waypoint in the log sequence that indicates where in the log stream ESE needs to start a recovery in case of a failure.
Checkpoint Depth - The checkpoint depth is a threshold that defines when ESE begins to aggressively flush dirty pages to the database.
ESE Cache - An area of memory reserved to the Information Store process (Store.exe) that is used to store database pages in memory. By storing pages in memory, this can reduce read I/Os, especially when the page is accessed many times; in addition, this cache can be used in two ways to reduce write I/Os - by storing the dirty page in memory, there is the potential for multiple changes to be made to the page before it is flushed to disk; also, the engine can write multiple pages together (up to 1MB) using write coalescing.
Log Buffers - Operations performed against the database are first held in memory before they are written to transaction logs.
Version Store - An in-memory list of modifications that are made to the database. The version store is used for roll-back of transactions, for write-conflict detection, and for other tasks. The version store entry for a specific operation is cleaned up only one time when the following conditions are true:
- The transaction that owns the operation has been either committed or rolled back.
- All the transactions that are outstanding at the time that the operation was performed have been either committed or rolled back.
EDB File - Exchange database (.edb) files are the repository for mailbox (or public folder) data. Within the database, data is stored in 8 KB database pages, which in turn are stored in a collection of balanced trees, known as B+ trees. A B+ tree, is a multi-level index where all records are stored in leaf nodes, which are logically linked together to allow efficient searching and sorting of data. This allows a B+ tree to promote a persistent arrangement of data and to be able to locate and retrieve information from the disk quickly.

Note: For more information on B+ trees, please see http://en.wikipedia.org/wiki/B%2B_tree. So how do the above components work together?

An operation occurs against the database (e.g. client sends a new message), and the page that requires updating is read from the file and placed into the ESE cache (if it is not already in memory), while the log buffer is notified and records the operation in memory.
The changes are recorded by the database engine but are not immediately written to disk; instead, these changes are held in the ESE cache and are known as "dirty" pages because they have not been committed to the database. The version store is used to keep track of these changes, thus ensuring isolation and consistency are maintained.
As the database pages are changed, the log buffer is notified to commit the change, and the transaction is recorded in a transaction log file (which may or may not require a log roll and a start of a new log generation).
Eventually the dirty database pages are flushed to the database file.
The checkpoint is advanced.

For more information, please see Understanding the Exchange 2007 Store at http://technet.microsoft.com/en-us/library/bb331958.aspx. Page Dependencies Removal As an ESE database is updated, B+ tree splits and merges mean that records have to be moved from one location in the database to another. The easiest way to log the record move is to log a deletion ("record A was removed from page 13") and an insertion ("record A was inserted into page 14"). The disadvantage of this approach is that the amount of data logged is proportional to the size of the records being moved. As the original insertion of the record ("record A was inserted into page 13") was logged it is redundant to re-log the same data. To overcome the issue of logging multiple copies of the data, ESE uses pages dependencies to reduce the amount of data that is logged during B+ tree splits or merges by forcing pages to be written to disk in a specific order. Instead of logging a deletion and an insertion we want to log a move operation ("record A was moved from page 13 to page 14"). Suppose there is a crash after record A is moved from page 13 to page 14. If pages 13 and 14 can be flushed to disk independently then there are four possible states for pages 13 and 14 in the database:

Page 13	Page 14	Recovery Action
A	<blank>	Move A from 13 to 14
<blank>	A	Nothing - A has already been moved
A	A	Delete A from page 13
<blank>	<blank>	Disaster - record A has been lost!

In the last case, page 13 was written to disk after record A was removed but page 14 (which now contains record A) has not been written to disk. To avoid this happening a page dependency is created between pages 13 and 14 so that page 14 must be written to disk before page 13. This means that after a crash only the first three cases above are possible. What are the benefits to this approach? Page dependencies are a cunning way to reduce the amount of data logged when records are moved between pages in the database. If we only have to log the record's location when moving it, instead of the actual record itself, then we are reducing log generation, which ultimately reduces capacity requirements. But this poses several problems:

Pages must be written to disk serially and the I/O cannot be coalesced. Not only does this have implications in terms of maintaining ACID, but it also has implications in terms of I/O. Each preceding page I/O must complete before the next one begins. It also means that if two (or more) consecutive pages have dependencies between them, they cannot be written in a single I/O. Multiple separate I/Os have to be performed serially.
During streaming backup, page dependencies prevent some pages from being flushed to disk. Suppose record A is moved from page 13, which has not been backed up yet, to page 14 which has been backed up. The backed-up copy of page 14 doesn't contain the record so page 13 should not be written to disk (remember the logged operation only records the move, not the data moved). If page 13 was written, the backup would copy that image of page 13 and end up containing an image of page 13 without the record, making recovery impossible. Older versions of Exchange dealt with this problem by writing pages to the patch file, but the current version simply avoids flushing page 13 until the backup is finished. This in turn prevents the checkpoint from advancing, causing JET_errCheckpointDepthTooDeep problems if a backup job hangs.
Recovering a single page cannot be done in isolation. At first glance restoring one page from a tape backup of an Exchange server looks easy - read the desired page from tape and then apply all the logged updates to that page. Unfortunately page dependencies make that impossible. Imagine a case where page 14 is being restored from tape and ESE encounters a logged update which says "Move record A from page 13 to page 14". Page 13 is now required to do the recovery - not the current version of page 13 (which doesn't contain record A) but an image of page 13 as it was at that point in time. This requires simultaneously restoring page 13 and page 14. In turn, page 13 might be the target of a record move so other pages will have to be restored as well. In general all the log files would have to be first restored and analyzed before single-page restore could be attempted. This would be incredibly complicated and require a lot of time and disk space, and is currently not something that can be done with Exchange 2007 (or any previous version).
High memory requirements per storage group. During online defragmentation, ESE reorganizes the database structure, performing thousands of page merges and page splits in the attempt to free up pages. Remember that the data is not moved, we just reference the location. As a result, huge dependency trees are generated from the thousands upon thousands of operations that are performed during the maintenance cycle. This requires additional memory to hold all the dirty cache. To account for this, our RTM memory requirements were a minimum of 512 MB per storage group.

So how can we address this situation? The simple answer is to remove page dependencies. Instead of creating page dependencies, we can simply log the entire source page when performing a record move or split. At recovery time, if the destination page has not flushed (e.g., it doesn't contain the data), but the source page has flushed (e.g., it doesn't contain the data either) the logged page image can be used to redo the data move. So what effect does removing page dependencies in Exchange 2007 SP1 have?

By removing page dependencies we have improved database I/O characteristics. No longer do we have to write operations in a serial order since pages do not depend upon one another. This improves our database write I/O - we can now coalesce multiple writes together, thus reducing the number of I/Os, while increasing the write size.
By removing page dependencies, we have reduced our minimum memory requirements per storage group. Now that we have removed page dependencies, we no longer generate large dependency trees during online defragmentation, and as a result, we have changed our memory guidance to require about 300MB per storage group. For more information on the minimum memory requirements, please take a look at http://technet.microsoft.com/en-us/library/bb738124.aspx.
By removing page dependencies, we have massively reduced the likelihood of hitting the JET_errCheckpointDepthTooDeep condition associated with hung backups. Suppose record A is moved from page 13, which has not been backed up yet, to page 14, which has been backed up. The backed-up copy of page 14 doesn't contain the record, but since we logged page 13 to perform the move to page 14, we can flush the updated page 13 to disk. Even though the backup will only contain a copy of page 13 without the record, because the logs contain the actual data moved, we can now ensure that recovery can happen and not lose any data. As a result, the checkpoint can advance because the pages can be written to disk.
Removing page dependencies means there are fewer dirty pages pinned in the cache which saves CPU resources and simplifies the job of the buffer manager. With page dependencies gone, the buffer manager can flush dirty pages when it sees fit, instead of having to walk the dependency tree to see if the flush can actually occur (in other words, we do not have to check to see if another page has to be written first due to page dependency). Walking the dependency tree is processor-expensive and gets even more expensive based on dependency tree depth (which is a function of the LLR waypoint depth).
Removing page dependencies allows us to further evolve the product in ways that are particularly beneficial for continuous replication and reseed operations. Future versions (beyond Exchange 2007) may be able to take advantage of database page patching.

However, removing page dependencies does have an impact. Page dependencies were originally conceived as a log optimization technique. By removing them, we now have to log the data being moved which means that log generations increase. Internally, we saw a 33% increase in log generation after we disabled page dependencies. The increase in log generation affects other things as well:

Backup times. The more logs generated, the longer it will take to complete backups.
Log capacity requirements. The more logs generated, the more capacity is required.

Right now, many of you may be thinking "Yikes, SP1 is going to pwn my log drives and since I didn't account for a 33% increase... darn you Exchange!" Relax; we have you covered. Here's how we addressed this situation. Disabling Partial Merges While removing page dependencies provides us many benefits, the increase in log generation and the repercussions of this increase are not ideal. So we had to find a way to reduce log generation without breaking the ACID rules. To solve this issue, we went back and analyzed the logs generated on servers. After disabling page dependencies, what we found was that a significant portion of log generation increase is due to the automatic online defragmentation of the database. Online defragmentation (OLD) is a process used to free up pages in the EDB file. This reduces the number of pages that have to be visited in order to locate or insert data into the appropriate place. Essentially what happens is that the OLD process navigates to the end of a B+ tree and starts moving records from the left most pages to the right most pages, collapsing the B+ tree as much as possible. In many cases, the engine merges the records from one page to another without actually freeing a page; this is known as a partial merge. The hope here by doing partial merges, is that during the next OLD pass, the page will be able to be freed. Partial merges were useful back when disk sizes were very small (think back to Exchange 5.5 days), since it was important to utilize the capacity and I/O effectively to ensure that every last byte on the storage was used effectively to make the database as dense as possible. However, partial merges have consequences. As many probably have witnessed using Performance Monitor, OLD is an extremely disk write I/O intensive process. In addition, since we are moving data around within the database, the operations need to be logged, thus making OLD log generation intensive as well. The database churn that occurs during OLD also has another side affect that customers saw with the release of Exchange Server 2003 - snapshots via VSS are rather large due to the fact that a significant portion of the database changes each time OLD executes. So what would happen if we disabled partial merges? We disabled partial merges, and two things were found: 1. With partial merges disabled, databases are not compacted as tightly. With partial merges disabled, we will only move records from one page to another if we can free up the entire source page. As a result, there is some bloat to the database, however the bloat is small and does not increase drastically over time. For example, consider the following server that had a 162 GB and a 171 GB stores. A stress test was performed and we analyzed the difference between having partial merges enabled and disabled. The end result is that, after four weeks of having partial merges disabled, the database file only increased in size about 2%.

4 week stress test	Partial Merges Enabled	Partial Merges Disabled
DB Size pre-defrag	162,588,409,856	171,188,961,280
% Available Pages	1.80%	1.33%
DB Size Post-defrag	158,247,026,688	163,652,911,104
% Difference	2.65%	4.45%

2. With partial merges disabled, the database churn and log generation numbers significantly decrease when OLD runs. In the following example, you can see another comparison between two storage groups, one that has partial merges enabled and the other has partial merges disabled. On the storage group that had partial merges enabled, 20 GB of the database was manipulated due to partial merges and 18 thousand log files were generated. Whereas, on the storage group without partial merges, only 5 GB of the database changed and only 13 thousand log files were generated. That's a reduction in database churn of ~80% and a reduction of ~25% in log generation.

3000 Mailbox server (8hours)	Partial Merges Enabled	Partial Merges Disabled
OLD Page Reads	56,160,000	57,830,400
OLD Pages Dirtied	3,830,400	691,200
OLD Pages Freed	392,457	201,600
BTree Partial Merges	2,494,080	0
Database Churn (GB)	30	5.5
Log Files Generated	18,269	13,701

In addition to OLD, we also found that partial merges were performed during normal runtime. Continuing with the 3000 mailbox server, we noticed that there was an average of 3 B+ tree partial merges/sec over a 24-hour period after we disabled partial merges in OLD. Each partial merge equated to roughly 3 page touches(dirtied), which over a 24 hour period resulted in 6000 logs being generated (the server generates around 110 thousand logs a day). By removing partial merges during normal runtime, we saw an additional 5% reduction in log generation. The benefit that partial merges provide in terms of database compactness is heavily outweighed by the cost to achieve that compactness (database churn and log generation). In the end, disabling partial merges netted us a reduction in log generation by 30% and substantially reduced our database churn during OLD. Log Generation Numbers & Message Profiles Even before we started coding SP1 we knew we were going to remove page dependencies. At the time we knew there would be a growth in log generation, and we didn't know how we would curb it. We assumed the worst case in that we would ship SP1 with a growth in log generation. So back in January of 2007, we released the storage calculator (http://go.microsoft.com/fwlink/?linkid=84202) and updated the storage design article (http://technet.microsoft.com/en-us/library/bb738147.aspx). One of the guidance changes included this table which associated log generation with the message profile:

Mailbox profile	Message profile	Logs generated / mailbox / day
Light	5 sent/20 received	7
Average	10 sent/40 received	14
Heavy	20 sent/80 received	28
Very heavy	30 sent/120 received	42

What we did not tell you at the time was that the values for the Logs generated / mailbox / day row included an increase for the page dependency removal. Now the good news is that since we disabled partial merges, the log generation growth experienced by removing page dependencies was canceled. As a result we are changing our log generation guidance to be as follows:

Mailbox profile	Message profile	Logs generated / mailbox / day
Light	5 sent/20 received	6
Average	10 sent/40 received	12
Heavy	20 sent/80 received	24
Very heavy	30 sent/120 received	36

Note: We will be updating the storage calculator and our storage guidance documentation on TechNet as a result. Conclusion To summarize, while disabling page dependencies included a 33% increase in log generation when compared with RTM, we were able to mitigate it by disabling partial merges. The end result is that we now have the following benefits:

No more checkpoint too deep errors as a result of hung backups.
Reduced memory requirements per storage group.
Smaller snapshot / differential backups.
A more efficient dirty page flushing model for the buffer manager, which improves write I/Os and reduces CPU cycles.
Future versions of Exchange may be able to evolve recovery and/or reseed operations.
Updated guidance on log generation numbers per message profile.

- Ross Smith IV

Updated Jul 01, 2019

Version 2.0

storage

The_Exchange_Team

Platinum Contributor

Joined April 19, 2019

View Profile

Exchange Team Blog

You Had Me at EHLO.