Real-Time Operational Analytics: Simple example using nonclustered clustered columnstore index (NCCI) in SQL Serve...

Sunil_Agarwal · ‎Mar 23 2019

First published on MSDN on Feb 29, 2016
The previous blog https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/02/29/real-time-operational-analyti... described how you can use updateable nonclustered columnstore index (NCCI) for real-time operational analytics. Let us now take an example of how NCCI can be used on a regular rowstore table in the context of an Order Management application. The key table for this application is ‘orders’ that tracks customer information, purchase price and the order status. It is expected that over time a large number of rows will have the order status as ‘Order Received’ (i.e. the order has been received by the customer. We don’t expect any more changes to the order once it has been received by the customer except when customer is not happy with the purchase and wants to return it. So for all practical purposes, once the order has been received by the customer, it can be considered that this row is now ‘cold’.


   

   create table orders (
   

   AccountKey int not null,
   

   customername nvarchar (50),
   

   OrderNumber bigint,
   

   PurchasePrice decimal (9,2),
   

   OrderStatus smallint not NULL,
   

   OrderStatusDesc nvarchar (50))

-- OrderStatusDesc
-- 0 => 'Order Started'
-- 1 => 'Order Closed'
-- 2 => 'Order Paid'
-- 3 => 'Order Fullfillment Wait'
-- 4 => 'Order Shipped'
-- 5 => 'Order Received'

Let us also create a clustered index on ‘OrderStatus’


   

   create clustered index orders_ci on orders(OrderStatus)

Now we load 3 million rows with the data pattern that 95% of the orders have already been received by the customer.


   

   -- insert into the main table load 3 million rows
   

   declare @outerloop int = 0
   

   declare @i int = 0
   

   declare @purchaseprice decimal (9,2)
   

   declare @customername nvarchar (50)
   

   declare @accountkey int
   

   declare @orderstatus smallint
   

   declare @orderstatusdesc nvarchar(50)
   

   declare @ordernumber bigint
   

   while (@outerloop < 3000000)
   

   begin
   

   Select @i = 0
   

   begin tran
   

   while (@i < 2000)
   

   begin
   

   set @ordernumber = @outerloop + @i
   

   set @purchaseprice = rand() * 1000.0
   

   set @accountkey = convert (int, RAND ()*1000)
   

   set @orderstatus = convert (smallint, RAND()*100)
   

   if (@orderstatus >= 5) set @orderstatus = 5

set @orderstatusdesc =
case @orderstatus
WHEN 0 THEN 'Order Started'
WHEN 1 THEN 'Order Closed'
WHEN 2 THEN 'Order Paid'
WHEN 3 THEN 'Order Fullfillment'
WHEN 4 THEN 'Order Shipped'
WHEN 5 THEN 'Order Received'
END
insert orders values (@accountkey,
(convert(varchar(6), @accountkey) + 'firstname'),@ordernumber, @purchaseprice,@orderstatus, @orderstatusdesc)

set @i += 1;
end
commit

set @outerloop = @outerloop + 2000
set @i = 0
end
go

Now let us create a nonclustered columnstore index as follows. Note, that it is just a DDL operation and similar to any other btree index that you would create on a rowstore table. No changes to the application needed. Unlike in earlier releases of SQL Server, the NCCI is updateable in SQL Server 2016 so your transaction workload will continue to run.


   

   --create NCCI
   

   CREATE NONCLUSTERED COLUMNSTORE INDEX orders_ncci ON orders  (accountkey, customername, purchaseprice, orderstatus, orderstatusdesc)

Let us load another 200k orders but these orders are not ‘yet’ received by the customers. Here is the script that I used


   

   --insert additional 200k rows
   

   declare @outerloop int = 3000000
   

   declare @i int = 0
   

   declare @purchaseprice decimal (9,2)
   

   declare @customername nvarchar (50)
   

   declare @accountkey int
   

   declare @orderstatus smallint
   

   declare @orderstatusdesc nvarchar(50)
   

   declare @ordernumber bigint
   

   while (@outerloop < 3200000)
   

   begin
   

   Select @i = 0
   

   begin tran
   

   while (@i < 2000)
   

   begin
   

   set @ordernumber = @outerloop + @i
   

   set @purchaseprice = rand() * 1000.0
   

   set @accountkey = convert (int, RAND ()*1000)
   

   set @orderstatus = convert (smallint, RAND()*5)
   

   set @orderstatusdesc =
   

   case @orderstatus
   

   WHEN 0 THEN 'Order Started'
   

   WHEN 1 THEN 'Order Closed'
   

   WHEN 2 THEN 'Order Paid'
   

   WHEN 3 THEN 'Order Fullfillment'
   

   WHEN 4 THEN 'Order Shipped'
   

   WHEN 5 THEN 'Order Received'
   

   END
   

   insert orders values (@accountkey,(convert(varchar(6), @accountkey) + 'firstname'),
   

   @ordernumber, @purchaseprice, @orderstatus, @orderstatusdesc)
   

   set @i += 1;
   

   end
   

   commit
   

   set @outerloop = @outerloop + 2000
   

   set @i = 0
   

   end
   

   go

If we look at the details on the columnstore index, you will see that 3 million rows are compressed while the new 200k rows are in the delta rowgroup


   

   -- look at the rowgroups
   

   select object_name(object_id), index_id, row_group_id, delta_store_hobt_id, state_desc, total_rows, trim_reason_desc, transition_to_compressed_state_desc
   

   from sys.dm_db_column_store_row_group_physical_stats
   

   where object_id = object_id('orders')

We will now run couple of analytics queries and compare the performance if NCCI was not there. These queries are simple and are not representative of the common real-time operational scenario involving multi-table joins but the intent here is to show you that query optimizer picks NCCI for analytics queries and provides significant speed up over traditional btree indexes. Also note, that the difference in performance will only widen as you add more data.


   

   
    Example-1:
   
   --run simple query
   

   select max (PurchasePrice)
   

   from orders

The query plan is as follows showing that query optimizer chose NCCI.

When running the same query without using NCCI, the query took much longer to run (on my machine, it was 12x slower)


   

   -- run the query without using NCCI
   

   select max (PurchasePrice)
   

   from orders
   

   option (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)

Example-2: Let us now consider a more complex query


   

   -- a more complex query
   

   select top 5 customername, sum (PurchasePrice), Avg (PurchasePrice)
   

   from orders
   

   where purchaseprice > 90.0 and OrderStatus=5
   

   group by customername

Here is the query plan for the query using NCCI. One thing to note the aggregate got pushed down to the scan node, More on this when we cover columnstore index performance improvements in SQL Server 2016

-- a more complex query without NCCI
select top 5 customername, sum (PurchasePrice), Avg (PurchasePrice)
from orders
where purchaseprice > 90.0 and OrderStatus = 5
group by customername
option (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)

Here is the run time information for both queries. Note that the query with NCCI ran 25x faster and took 1/60th CPU resources

Key take aways from this blog are (1) you can enable real-time operational analytics without any changes to your transactional workload (2) SQL Query Optimizer will choose NCCI for analytics queries automatically based on the cost model. In the next blog, I will show you filtered NCCI that minimizes the impact on the OLTP workload but still provides real-time operational analytics

Thanks

Sunil Agarwal

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Real-Time Operational Analytics: Simple example using nonclustered clustered columnstore index (NCCI) in SQL Serve...