In Exchange Server 2010, there is no more single instance storage (SIS). To help understand why SIS is gone, let's review a brief history of Exchange.
During the development of Exchange 4.0, we had two primary goals in mind, and SIS was borne out of these goals:
- Ensure that messages were delivered as fast and as efficient as possible.
- Reduce the amount of disk space required to store messages, as disk capacity was premium.
Exchange 4.0 (and, to a certain extent, Exchange 5.0 and Exchange 5.5) was really designed as a departmental solution. Back then, users were typically placed on an Exchange server based on their organization structure (often, the entire company was on the same server). Since there was only one mailbox database, we maximized our use of SIS for both message delivery (only store the body and attachments once) and space efficiency. The only time we created another copy within the store was when the user modified their individual instance.
For almost 19 years, the internal Exchange database table structure has remained relatively the same:
Then came Exchange 2000. In Exchange 2000, we evolved considerably - we moved to SMTP for server-to-server connectivity, we added storage groups, and we increased the maximum number of databases per server. The result was a shift away from a departmental usage of Exchange to enterprise usage of Exchange. Moreover, the move to 20 databases reduced SIS effects on space efficiency, as the likelihood that multiple recipients were on the same database decreased. Similarly, message delivery was improved by our optimizations in transport, so transport no longer benefited as much from SIS either.
With Exchange 2003, consolidation of servers took off in earnest due to features like Cached Exchange Mode. Again the move away from departmental usage continued. Many customers moved away from distributing mailboxes based on their organization structure to randomization of the user population across all databases in the organization. Once again, the space efficiency effects of SIS were further reduced.
In Exchange 2007, we increased the number of databases you could deploy, which again reduced the space efficiency of SIS. We further optimized transport delivery and completely removed the need for SIS from a transport perspective. Finally, we made changes to the information store that removed the ability to single instance message bodies (but allowed single instancing of attachments). The result was that SIS no longer provided any real space savings - typically only about 0-20%.
One of our main goals for Exchange 2010 was to provide very large mailboxes at a low cost. Disk capacity is no longer a premium; disk space is very inexpensive and IT shops can take advantage of larger, cheaper disks to reduce their overall cost. In order to leverage those larger capacity disks, you also need to increase mailbox sizes (and remove PSTs and leverage the personal archive and records management capabilities) so that you can ensure that you are designing your storage to be both IO efficient and capacity efficient.
During the development of Exchange 2010, we realized that having a table structure optimized for SIS was holding us back from making the storage innovations that were necessary to achieve our goals. In order to improve the store and ESE, to change our IO profile (from many, small, random IOs to larger, fewer, more sequential IOs), and to resolve our inefficiencies around item count, we had to change the store schema. Specifically, we moved away from a per-database table structure to a per-mailbox table structure:
This architecture, along with other changes to the ESE and store engines (lazy view updates, space hints, page size increase, b+ tree defrag, etc.), netted us not only a 70% reduction in IO over Exchange 2007, but also substantially increased our ability to store more items in critical path folders.
As a result of the new architecture and the other changes to the store and ESE, we had to deal with an unintended side effect. While these changes greatly improved our IO efficiency, they made our space efficiency worse. In fact, on average they increased the size of the Exchange database by about 20% over Exchange 2007. To overcome this bloating effect, we implemented a targeted compression mechanism (using either 7-bit or XPRESS, which is the Microsoft implementation of the LZ77 algorithm) that specifically compresses message headers and bodies that are either text or HTML-based (attachments are not compressed as typically they exist in their most compressed state already). The result of this work is that we see database sizes on par with Exchange 2007.
The below graph shows a comparison of database sizes for Exchange 2007 and Exchange 2010 with different types of message data:
As you can see, Exchange 2007 databases that contained 100% Rich Text Format (RTF) content was our baseline goal when implementing database compression in Exchange 2010. What we found is that with a mix of messaging data (77% HTML, 15% RTF, 8% Text, with an average message size of 50KB) that our compression algorithms are on par with Exchange 2007 database sizes. In other words, we mitigated most of the bloat caused by the lack of SIS.
Is compression the answer to replacing single instancing all together? The answer to that question is that it really does depend. There are certain scenarios where SIS may be viable:
- Environments that only send Rich-Text Format messages. The compression algorithms in Exchange 2010 do not compress RTF message blobs because they already exist in their most compressible form.
- Sending large attachments to many users. For example, sending a large (30 MB+) attachment to 20 users. Even if there were only 5 recipients out of the 20 on the same database, in Exchange 2003 that meant the 30MB attachment was stored once instead of 5 times on that database. In Exchange 2010, that attachment is stored 5 times (150 MB for that database) and isn't compressed. But depending on your storage architecture, the capacity to handle this should be there. Also, your email retention requirements will help here, by forcing the removal of the data after a certain period of time.
- Business or organizational archives that are used to maintain immutable copies of messaging data benefit from single instancing because the system only has to keep one copy of the data, which is useful when you need to maintain that data indefinitely for compliance purposes.
If you go back through our guidance over the past 10 years, you will never find a single reference to using SIS around capacity planning. We might mention it has an impact in terms of the database size, but that's it. All of our guidance has always dictated designing the storage without SIS in mind. And for those that are thinking about thin provisioning, SIS isn't a reason to do thin provisioning, nor is SIS a means to calculate your space requirements. Thin provisioning requires an operational maturity that can react quickly to changes in the messaging environment , as well as, a deep understanding of the how the user population behaves and grows over time to sufficiently allocate the right amount of storage upfront.
In summary, Exchange 2010 changes the messaging landscape. The architectural changes we have implemented enable the commoditization of email - providing very large mailboxes at a low cost. Disk capacity is no longer a premium. Disk space is cheap and IT shops can take advantage of larger, cheaper disks to reduce their overall cost. With Exchange 2010 you can deploy a highly available system with a degree of storage efficiency without SIS at a fraction of the cost that was required with previous versions of Exchange.
So, there you have it. SIS is gone.
- Ross Smith IV
You Had Me at EHLO.