This blogging thing sucks you in, doesn't it? Not content with having an ongoing series on disaster recovery and CHECKDB (with another 6 and 25 more posts planned respectively), I'm starting a new series on fragmentation. This will begin from first principles and work up, in approximately 18 posts over the next few months. The first few posts could be skipped by some people and will cover:
- What are records?
- What are pages?
- What are extents?
- What is a heap?
- What is a clustered index?
- What is a non-clustered index?
Bear with me as I build up the terminology. You could also check out Kalen's excellent book Inside SQL Server 2000 for details on these topics. (Her upcoming volume on the Storage Engine for Inside SQL Server 2005 should be out sometime this summer - buy it!).
The idea for this series came to mind last Friday at TechEd when I spent about 3 hours repeating a deck on fragmentation I gave to the North Texas SSUG in April 2004, people kept wandering by and stopping to listen. If that many people are interested in this stuff, it should make good blog material, and it also seems to be frequently misunderstood. The level of sophistication here ranges from not having any idea what fragmentation is up to defragging or rebuilding all indexes every night. I didn't see anyone there who knew just when it was worth removing fragmentation so I decided to explain here. The first 6 posts will also be useful as background for the CHECKDB internals series I'm doing.
So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...
Data records
These store table rows in a heap, or in the leaf level of a clustered index. They store all the columns of the table row that will fit in a single 8Kb page.
Index records
There are two types of index records (which differ only in what columns they store):
- Those that store non-clustered index rows
- Those that comprise the b-tree that make up clustered and non-clustered indexes.
Text records
- There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page.
- An in-depth explanation of the various types and how they hang together is beyond the scope of this post and discussion.
- They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.
Ghost records
- These are records that have been logically deleted but not physically deleted from the leaf level of an index.
- The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
- The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task ) or it is converted back to a real record by an insert of a record with the exact same set of keys.
- Ghost records will be mentioned later in the series when I discuss page compaction.
Other record types
- There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page).
- These are not relevant to a discussion of fragmentation so won't be considered further here.
Record structure
All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.
The record structure isn't relevant to a discussion on fragmentation but is for CHECKDB internals, so here it is:
- record header
- 4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap
- fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint , char(10) , datetime )
- NULL bitmap
- two bytes for count of columns in the record
- variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
- this allows an optimization when reading columns that are NULL
- variable-length column offset array
- two bytes for the count of variable-length columns
- two bytes per variable length column, giving the offset to the start of the column value
- versioning tag
- this is in SQL Server 2005 only
- this is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb
If you have any questions on this stuff - put them in the comments of drop me an email.
Next time - what are pages?