First published on MSDN on Sep 05, 2011 by Martin Merdes.
SAP encourages their customers to convert existing non-Unicode systems to Unicode. However, several SQL Server customers hesitated to perform a Unicode Conversion, because of the increased storage requirements of UCS-2. This issue has been finally solved with Unicode-compression, which is a new feature of SQL Server 2008 R2. A Unicode Conversion running on SQL Server 2008 R2 even results in a decreased database size. Therefore hardware restrictions are no longer an excuse to stay on non-Unicode. The space reduction is also documented in SAP note 1139642 (Hardware Requirements in Unicode Systems) .
In recent conversations with customers and partners it turned out, that the SQL Server feature “Unicode-compression” was not fully understood. Although the feature contains the word “compression”, it is rather an improved encoding than a typical database compression feature (like page-compression). However, when doing a system copy, the database will be page-compressed by default. Therefore you will see space savings caused by page-compression in addition to savings caused by the improved encoding of Unicode-compression. As a result, the (used space inside the) database will shrink more than 50% during a Unicode Conversion.
To better understand Unicode-compression, let’s first have a look at the space requirements of UCS-2 in the past:
Space requirement for an SAP database up to SQL Server 2008
SAP always used the UCS-2 character encoding for SQL Server, which uses a 2 byte fixed length encoding per character. This resulted in a space usage of 18 bytes to store the string “Microsoft” (which consists of 9 characters) in a column with data type NVARCHAR(40). In the Stone Age SQL Server releases 6.0 and 6.5 it was even worse, since all strings were stored in fixed length data types. NCHAR(40) results in 80 bytes on the disk to store the string “Microsoft”. With SQL Server 7.0 and 2000 a major improvement was to use variable length strings for SAP (but still fixed length 2 byte encoding for Unicode characters). In SQL Server 2005 we introduced the vardecimal storage format, which decreased the space usage for decimal data, but did not change anything for character data (the vardecimal storage format was only used by SAP BW). SQL Server 2008 was a huge step forward to save disk space. The data compression types NONE, ROW and PAGE were introduced. The strongest compression type PAGE is a real database compression, which compresses whole database pages and therefore works for all data types. The compression type ROW implemented variable length storage of all numeric data types (including INT, not only DECIMAL). It had no effect on character data types. Finally, the compression type NONE guaranteed the exact same storage format as in older SQL Server releases. The only reason for having the compression type NONE is to enable an upgrade of SQL Server without changing the content of the user databases. There is no other benefit of using compression type NONE in comparison to ROW.
Unicode improvements as of SQL Server 2008 R2
One of the main demands for SQL Server 2008 R2 was to reduce the storage requirements for an SAP Unicode database. The natural way would have been to introduce UTF-8 character encoding. However, this would have resulted in a different data type (NVARCHAR for UCS-2 and VARCHAR for UTF-8). This was simply not acceptable for SAP and their customers. Changing the data type of existing SAP Unicode systems would have resulted in a complete data export and import of the SAP database. The downtime for this procedure could easily be several days for huge SAP databases. SAP required a transparent way to store Unicode data with UCS-2 as efficient as with UTF-8. The new feature in SQL 2008 R2 called Unicode-compression meets these demands by SAP. The feature name contains the word “compression”, because it only works with compression type ROW and PAGE. As already mentioned, compression type NONE was merely introduced in order to keep the old storage format during a SQL Server upgrade.
Technical details of Unicode-compression
Unicode-compression is totally transparent for a SQL Server administrator. There is no switch to turn this feature on or off. There is no different data type. Strings are encoded much more efficient, in the best case 1 byte per character. In the worst case, the space usage for a string will never be higher than before (without Unicode-compression). The trick is to have 2 different encoding schemes. Each string can be stored in one of the 2 schemes. After upgrading from SQL Server 2008 to SQL Server 2008 R2, all Unicode strings are stored in the old scheme using 2 bytes per character. Once you insert or update a row, the containing strings are encoded using the new scheme, having a variable length of bytes per character (1 byte for ASCII characters, up to 4 bytes for extremely seldom characters). For strings resulting in an even number of bytes, an additional zero byte is added. The new scheme is used, if it needs less bytes than the old one. Otherwise the string is stored using the old scheme. When reading a row, you can identify the encoding scheme simply by looking at the storage size of each string. If the storage size is an even number of bytes, then the old encoding scheme was used. For an odd number of bytes, it was the new encoding scheme. You can directly see the improvements of the new encoding scheme by using DBCC PAGE: