SQL Server Blog

14 MIN READ

Data Loading performance considerations with Clustered Columnstore indexes

Microsoft

Dec 20, 2018

First published on MSDN on Mar 11, 2015

This article describes data loading strategies specific to tables with a Clustered Columnstore index on SQL Server 2014. For fundamental data loading strategies, an excellent read is the whitepaper Data Loading Performance Guide and it is greatly recommended. Though that whitepaper doesn’t include Columnstore indexes, many other concepts presented there still hold true for any data loading into SQL server.

Overview of Loading Data into Columnstore Indexes

When you insert or load data into a table that has a Clustered Columnstore index , the insert lands in one of 2 places:

The delta store
Compressed Columnstore

If you insert a small number of rows (< 102,400 rows) in a table with a Clustered Columnstore index ends up in the “delta store” . The “delta-store” is a row store. When this Delta store has 1,048,576 rows, it is marked as CLOSED and is compressed into columnar format by a background process called the Tuple Mover. The Tuple mover encodes and compresses the data and stores it in a columnar format.

If you are bulk loading data, whether this data lands in the delta store (row store) or directly in a compressed Columnstore depends on the batch size specified during the bulk insert process. If the Batch size for the insert is > 102,400 rows, the data no longer ends up in the delta store, rather is inserted directly into a compressed rowgroup in columnar format. This is true whether it is Bulk insert or BCP or anything using the Bulk API, even true for insert/select operations inserting a large amount of data as long as a single batch size inserts more than 102,400 rows.The size of a rowgroup is capped at 1,048,576 rows. Ideally we want the compressed rowgroup to be as close to that upper limit as the potential for larger compression exists. The added benefit of having a higher density rowgroup is scanning less row groups when querying the data.

The rowgroup size is not just a factor of the max limit but is affected by the following factors .

The dictionary size limit which is 16MB
Insert batch size specified
The Partition scheme of the table since a rowgroup doesn’t span partitions
Memory Grants or memory pressure causing row groups to be trimmed
Index REORG being forced or Index Rebuild being run

Transaction Logging implications

In a Data-warehouse scenario, assuming that our recovery model is SIMPLE or BULK-Logged , the trace flag 610 can have implications in certain cases even in the Columnstore scenario. There is a detailed explanation of trace flag 610 in the Data Loading Guide under the section I/O Impact of Minimal Logging under Trace Flag 610

The delta store is a page compressed Btree and so the minimal logging considerations still apply. Below is an example of a bulk load of 50,000 rows into a table. Given the number of rows is < 102,400, it lands in the delta store. Logging of any inserts into the delta store are affected by the trace flag 610.

Looking at the Log Records:

select count(*) as CountLogRecords from fn_dblog(NULL,NULL)

Ordering Data at Initial Load Time

When loading data into a clustered Columnstore index, consideration should be given not only to data load performance but also to query performance. Unlike a btree clustered index, where rows are ordered, loading into a clustered Columnstore index doesn’t order the data. Row groups are compressed in the order that the data is loaded and no specific ordering is done post rowgroup compression.

Say for example, we have a table with one of the columns being MarketID and we have a daily load that happens for all markets. If the input data that is loaded is not ordered by MarketID, then then queries that encompass the Market ID as a group by Column or predicate may have to scan all the segments. Due to how the data was inserted, rowgroups may not be able to be eliminated

In a case such as this, for the initial data load into a clustered column store table, it may be beneficial to sacrifice some load speed in favor of query speed by implementing the approach below

Load Data into a Heap ( Can be loaded concurrently with TABLOCK)
Create a Clustered index on the required column (MarketID in this example)
Create a Clustered Columnstore index with DROP_EXISTING = ON clause.

This will increase your load time but the trade-off is better rowgroup elimination and query execution time. This strategy can be used only on the initial load. Incremental data loads would benefit from the loaded data being ordered or partitioned by MarketID if the application or ETL process has the ability to sort the data prior to inserting or produce sorted data when reading from the data source.

Data Loading Scenarios: Non-Partitioned table

In this scenario we will look at loading data into a non-partitioned table and specifically look at the impact of parallel loading of data streams and the effect batch size in loading data.

Concurrent Loading

Loading in concurrent Streams can increase the Load throughput into a single table. In this case when we load concurrently, multiple row groups are created whether delta rowgroups or compressed rowgroups. Concurrent loading of the data does enable us to utilize multiple cores on the system and in an ideal scenario with no other bottleneck you can spawn one bulk load per core. How many parallel streams to spawn in reality though depends on both the size of data and resources on the server as at some point you will hit a physical resource bottleneck whether it is CPU or Memory or Disk IO or a logical resource bottleneck such as contention on a ROWGROUP_FLUSH lock which can be induced by slower disk IO on the Log drive. You can use the Waits and Queues Methodology alongside performance monitor to find out your primary bottleneck if at some point you do not get closer to linear scale as you increase the number of streams. A good tool to analyze the waits is SQL Nexus

The data below is from some tests run on a particular table on a particular server (see appendix). The objective of this is to demonstrate that parallel streams do improve throughput.

Data Input: 32 files for a total of 340 GB, 3 billion rows inserted into a single non-partitioned table. All the 32 files are processed for each run, the number of concurrent streams indicate how many files are processed in parallel.

Table Type

Concurrent
streams

Duration

hh:mm:ss

DBSize_GB

Throughput
GB /Hour

AvgCPU

MaxCPU

CCI

2:05:01

164

135

CCI