In this post, I’m going to take a look at how SQL Server parallelizes scans. The scan operator is one of the few operators that is parallel “aware.” Most operators neither need to know nor care whether they are executing in parallel; the scan is an exception.
How does parallel scan work?
The threads that compose a parallel scan work together to scan all of the rows in a table. There is no a-priori assignment or rows or pages to a particular thread. Instead, the storage engine dynamically hands out pages to threads. A parallel page supplier coordinates access to the pages of the table. The parallel page supplier ensures that each page is assigned to exactly one thread and, thus, is processed exactly once.
At the beginning of a parallel scan, each thread requests a set of pages from the parallel page supplier. The threads then begin processing these pages and returning rows. When a thread finishes with its assigned set of pages, it requests the next set of pages from the parallel page supplier.
This algorithm has a couple of advantages:
It is independent of the number of threads. We can add and remove threads from a parallel scan and it automatically adjusts. If we double the number of threads, each thread processes (approximately) half as many pages. And, if the I/O system can keep up, the scan runs twice as fast.
It is resilient to skew or load imbalances. If one thread runs slower than the other threads, that thread simply requests fewer pages while the other faster threads pick up the extra work. The total execution time degrades smoothly. (Compare this scenario to what would happen if we statically assigned pages to threads: the slow thread would dominate the total execution time.)
Let’s begin with a simple example. To get a parallel plan, we’ll need a fairly big table; if the table is too small, the optimizer will conclude that a serial plan is perfectly adequate. The following script creates a table with 1,000,000 rows and (thanks to the fixed length char(200) column) about 27,000 pages. Warning: If you decide to run this example, it could a few minutes to populate this table.
create table T (a int, x char(200))
set nocount on
declare @i int
set @i = 0
while @i < 1000000
insert T values(@i, @i)
set @i = @i + 1
Now, for the simplest possible query:
select * from T
We get a serial plan! Why don’t we get a parallel plan? Parallelism is really about speeding up queries by applying more CPUs to the problem. The cost of this query is dominated by the cost of reading pages from disk (which is mitigated by read ahead rather than parallelism) and returning rows to the client. The query uses relatively few CPU cycles and, in fact, would probably run slower if we parallelized it.
If we add a fairly selective predicate to the query, we can get a parallel plan:
By running this query in parallel, we can distribute the cost of evaluating this predicate across multiple CPUs. (In this case, the predicate is so cheap that it probably does not make much difference whether or not we run in parallel.)
As I mentioned above, the parallel scan algorithm dynamically allocates pages to threads. We can see this in action. Consider this query which returns every row of the table:
select * from T where a % 2 = 0 or a % 2 = 1
The peculiar predicate confuses the optimizer which underestimates the cardinality and generates a parallel plan:
|--Table Scan(OBJECT:([T]), WHERE:([T].[a]%(2)=(0) OR [T].[a]%(2)=(1)))
On SQL Server 2005, using “SET STATISTICS XML ON” we can see exactly how many rows each thread processes. Here is an excerpt of the XML output on a two processor system:
We can see that both threads (threads 1 and 2) processed approximately half of the rows. (Thread 0 is the coordinator or main thread. It only executes the portion of the query plan above the topmost exchange. Thus, we do not expect it to process any rows for any parallel operators.)
Now let’s repeat the experiment, but let’s run an expensive serial query at the same time. This cross join query will run for a really long time (it needs to process one trillion rows) and use plenty of CPU cycles:
select min(T1.a + T2.a) from T T1 cross join T T2 option(maxdop 1)
This serial query will consume cycles from only one of the two CPUs. While it is running, let’s run the other query again:
This time thread 1 processed more than 90% of the rows while thread 2 which was busy executing the above serial plan processed far fewer rows. The parallel scan automatically balanced the work across the two threads. Since thread 1 had more free cycles (it wasn’t competing with the serial plan), it requested and scanned more pages.
If you try this experiment, don’t forget to kill the serial query when you are done! Otherwise, it will continue to run and waste cycles for a really long time.
The same load balancing that we just observed applies equally whether a thread is slowed down because of an external factor (such as the serial query in this example) or because of an internal factor. For example, if it costs more to process some rows than others, we will see the same behavior.