Optimizing BW Query Performance

Iron Contributor

Mar 13, 2019

First published on MSDN on Mar 19, 2013 by Martin Merdes.
This blog post might be outdated by now. The latest documentation regarding SQL Server Columnstore on SAP is available in https://www.sap.com/documents/2019/04/023e5928-487d-0010-87a3-c30de2ffd8ff.html and in https://launchpad.support.sap.com/#/notes/2116639
The performance impact of SQL Server Columnstore is described in https://techcommunity.microsoft.com/t5/Running-SAP-Applications-on-the/Performance-evolution-of-SAP-BW-on-SQL-Server/ba-p/368073

SAP BW query performance depends on many factors: hardware, database configuration, BW configuration and last but not least BW cube design and BW query design. When running a query, there are several caches involved: disk caches, database caches and the BW OLAP cache. You should keep this in mind when comparing BW query performance of different configurations or even different systems. In the following we are discussing the configuration options, which are specific for Microsoft SQL Server.

Prerequisites

First of all you have to make sure that there is no bottleneck in the most important system resources: CPU, memory and I/O. You can configure the maximum number of CPU threads, which are used for a single database query (see below). However, SQL Server may reduce the actual number of used threads, if there are not enough free worker threads. Therefore the runtime of the same query can vary greatly, dependent on the current system load. A memory bottleneck on SQL Server may result in additional I/O. When there are sufficient CPU and memory resources, then repeatedly running queries are fully cached. In this case the performance of the I/O system is not crucial.

A huge part of the overall runtime of a BW query can be consumed on the SAP application server, not on the database server. Therefore the system resources on the application server are important, too. A simple BW query typically consists of 3 parts:

A database query running against the f-fact table of the cube

A parallel running database query against the e-fact table of the cube

An aggregation of the two result sets running in the OLAP processor on the application server (this process has nothing to do with "BW Aggregates", see below)

If you have never run BW cube compression (see below), then all data is in the f-fact table. If you do run BW cube compression after each data load, then all data is in the e-fact table. In both cases, there is no need to aggregate the two results sets, which reduces the BW query runtime on the application server.

In SAP BW data is typically loaded into cubes using BW process chains. These chains contain steps for dropping/re-creating indexes and updating the database statistics (for small dimension tables). If one of these steps fails, then you may clearly see a BW query performance issue.

1. Number of requests in f-fact table

The f-fact table of a cube is optimized for data load (DTP: Data Transfer Process), request deletion and BW cube compression. On SQL Server, the f-fact table is partitioned by the data load request ID (as it is the case for most other database systems supported by SAP BW). Having a separate partition per request ID, the request deletion is very fast and does not consume considerable transaction log space. The BW cube compression also benefits from this, because part of the BW cube compression is the deletion of the compressed requests. However, BW query performance does not benefit from this kind of partitioning. Quite the contrary, having more than a few hundred partitions results in a decreased BW query performance. We ran a suite of queries, running against f-fact tables with the same data but different number of partitions:

Number of partitions on f-fact table	10	100	200	500
cold (empty SQL data cache)	100%	125%	152%	202%
warm (table already in data cache)	100%	110%	135%	142%

Average query runtime compared with 10 partitions (in percent, lower number is better)

Since the partitions are created and dropped automatically on the f-fact table, a BW administrator has no direct influence on the number of partitions. However, there are two ways to reduce the number of partitions. First of all, you should keep the number of data loads (DTPs) into a cube as low as possible by avoiding small request. You can combine many small requests to a single, large one by loading the requests first into a Data Store Object (DSO). The following DTP from the DSO to the cube creates then fewer but larger requests in the cube.

Secondly, you can perform BW cube compression, which reduces the number of partitions again. For best query performance you should compress all requests anyway (which deletes all rows in the f-fact table). BW query performance is still good, if just most of the requests are compressed. Some customers keep new requests for at least one week uncompressed in the f-fact table. Thereby they can easily delete faulty requests, which were loaded into the cube by mistake.

2. BW cube compression

The e-fact table of a cube is optimized for query performance. The process "BW cube compression" moves (aggregates) single requests from the f-fact table to the e-fact table. Dependent on the kind of data, the total number of rows can dramatically be reduced by this aggregation. SAP generally recommends BW cube compression for performance reasons, see http://help.sap.com/saphelp_nw73/helpdata/en/4a/8f4e8463dd3891e10000000a42189c/content.htm . BW cube compression has further advantages for inventory cubes. As a side effect, it reduces the number of partitions on the f-fact tables.

For Microsoft SQL Server, we did not always benefit from BW cube compression in the past. Reason was the index layout using a heap for the e-fact table when having conventional b-tree indexes.

However, when using SQL Server 2012 column-store index, we strongly benefit from BW cube compression for BW query performance. The process of cube compression became much faster, although it contains an additional step: It fully re-organizes the column-store index. Since the creation of a column-store index scales up very well, we use 8 CPU threads for this by default. You can change the default by setting the RSADMIN parameter MSS_MAXDOP_INDEXING using report SAP_RSADMIN_MAINTAIN.

3. SQL Server compression

In typical customer scenarios, with sufficient CPU resources and a fast I/O system, the performance impact of using SQL Server PAGE compression is minimal. When CPU is a bottleneck, PAGE compression may result in reduced query performance. On the other hand, query performance increases by PAGE compression, if I/O is a bottleneck. Therefore you should simply keep SAP's default: PAGE compression.

In our suite of test queries we have clearly seen this expected behavior of PAGE compression. When the table already was in the SQL Server data cache, then the increased CPU usage resulted in slightly slower BW queries. On the other hand, the query performance was better, if the data head to be read from disk first.

SQL Server compression	NONE	ROW	PAGE
cold (empty SQL data cache)	100%	76%	75%
warm (table already in data cache)	100%	101%	110%

Average query runtime compared with NONE compressed f-fact table (in percent, lower number is better)

The disk space savings of ROW compression was as expected. The additional space savings of PAGE compression was only moderate, because the f-fact table only contains numeric fields. Best compression ratios have been seen with string fields. A large number of partitions also results in an increased space usage per table.

Average disk space usage compared with NONE compressed f-fact table, 100 partitions
(in percent, lower number is better)

4. Degree of parallelism

Per default, SAP BW requests two SQL Server threads per query (by using a MaxDop hint). You can change this default behavior by setting the RSADMIN parameter MSS_MAXDOP_QUERY using report SAP_RSADMIN_MAINTAIN.

We measured the performance when running BW queries on BW compressed cubes with column-store index, dependent on the used degree of parallelism. Two suites of queries were running against two different cubes. The biggest improvement was seen when moving from 1 CPU thread to 2 threads (This is typically only the case for column-store indexes, not for conventional b-tree indexes). Increasing to 4 threads improved performance noticeably. A further increase to 8 threads did not have any impact in many cases:

MaxDop	1	2	4	8
Cube 1 (100,000,000 rows)	1.00	8.13	12.26	15.61
Cube 2 (10,000,000 rows)	1.00	4.11	4.62	4.39

Average speed increase of column-store indexes compared with Maxdop 1
(in factors, higher number is better)

The impact of MaxDop depends on many factors like cube design, query design, actual data and used hardware. However, in all cases we have cleary seen the negative impact when using only a single CPU thread per query. That is why you should never set MSS_MAXDOP_QUERY to 1.

When there is a (temporary) CPU bottleneck, SQL Server can reduce the actual number of used threads for a query. In extreme cases, this could end up in one thread, even when MSS_MAXDOP_QUERY is set to 4 or higher. In particular when running BW queries against multi-providers you can run into this issue. A BW multi-provider is a logical cube, which retrieves the data from multiple basis cubes at the same point in time. This results in many simultaneously running SQL queries: 2 SQL queries (one on the f-fact table and one on the e-fact table) per basis cube. When using MaxDop 4 for a multi-provider consisting of 4 basis cubes, SQL Server may need up to 32 threads (2 tables * 4 cubes * MaxDop 4 = 32 threads). On a database server with less than 32 CPU threads this can result in actually using MaxDop 1 for at least one SQL query. Keep in mind, that the response time of a BW query is determined by the slowest participating SQL query.

That is why the default value of MSS_MAXDOP_QUERY (=2) was chosen relatively low. On a database server with more than 64 CPU threads, you may increase MSS_MAXDOP_QUERY to 3 or 4 (dependent on your workload while BW queries are running).

For best BW query performance, one should avoid running BW queries during high workload. For example, BW cube compression can be very CPU intensive, since it includes the re-creation of the column-store index using 8 CPU threads by default.

5. SQL Server 2012 Column-store

Using SQL Server 2012 column-store results in 50% to 80% space savings on e-fact tables, even compared with SQL Server PAGE compression. Customers report an increased query performance of factor 3 to 7 in average (compared with conventional b-tree indexes on the same system). Some BW queries hardly benefit from the column-store, while a few queries are even 20 times faster. However, for SAP BW and SQL Server 2012, the column-store is only available for the e-fact table. Therefore you must perform BW cube compression, if you want to benefit from the column-store.

A column-store index is optimized for large tables having some million rows. It is described in detail in a white paper ( http://scn.sap.com/docs/DOC-33129 ) and in https://service.sap.com/sap/support/notes/1771177 . Cubes with small fact tables, having only a few thousand rows, do hardly benefit from the column-store. However, the query performance of such small cubes is typically not an issue. Therefore we recommend using the column-store for all of your BW cubes, independent of their (current) size. Keep in mind that BW cubes are typically growing by and by.

6. Partitioning of e-fact table

In SAP BW you can optionally (and manually) partition the e-fact table by a time characteristics. In contrast to the partitioning of the f-fact table, this partitioning is intended to increase query performance. In the past we have seen mixed results when testing the impact of partitioning for BW query performance on SQL Server. Therefore it was typically not worth to re-partition a cube. However, when using SQL Server column-store, we benefit from partitioning of large e-fact tables.

Our test suite running on a 100,000,000 row cube showed the following results with MaxDop 4:

Index type	b-tree	column-store
Partition type	non-part.	partitioned	not-part.	partitioned
cold (empty SQL buffer cache)	1.00	1.06	4.77	9.83
warm (data in cache)	1.00	0.92	6.78	7.64
cold (empty SQL buffer cache)			1.00	2.03
warm (data in cache)			1.00	1.14

Average speed increase (in factors, higher number is better )
compared with b-trees, non-partitioned (lines 1 and 2)
and compared with column-store, non-partitioned (lines 3 and 4)

For SQL Server column-store indexes we see consistent query performance improvements when using partitioning of the e-fact table. The performance improvements compared with non-partitioned column-store are moderate (factor 1.14 if data is already in cache). However, this is an additional performance increase compared with conventional b-trees. For example, when a non-partitioned column-store index was 6.78 times faster than a b-tree index, then the partitioned column-store index was 7.64 times faster.

A column-store index is optimized for large tables having some million rows. Internally, a column-store index is divided into segments. Each segment contains up to one million rows. When using 8 CPU threads for creating the column-store index, you typically see 8 segments (per column), which are not fully filled with one million rows. When using partitioning for small tables, you further decrease the average segment size of the column-store index. Having too small segments decreases the query performance. Therefore you should consider partitioning only for tables with at least some dozen of million rows.

To fully benefit from the performance improvements of the partitioned column-store, you have to apply the latest version of SAP note 1771177 first (which will be released in April 2013). Then you should re-create the indexes of existing, partitioned cubes. The new code improvements optimize column-store segment elimination, in addition to having partitions. Therefore, you have a performance benefit on BW cubes (containing a filter on a time characteristics), even when creating only a single partition.

7. BW Aggregates

A BW aggregate is neither a configuration option nor is it SQL Server specific. However, we want to discuss aggregates here, since they were the preferred means to increase BW query performance in the past. A BW aggregate is a copy of an existing BW basis cube with a restricted number of characteristics and/or applied filters. BW aggregates are optimized for one or a few BW queries. Therefore you typically have to create many aggregates in order to support all BW queries running against a single basis cube. Technically, a BW aggregate looks like a conventional cube. It has two fact tables, each of them having its own database indexes. Since BW aggregates are logical copies of the cube, they have to be manually loaded and compressed each time data is loaded into the basis cube.

On the contrary, a column-store index on a basis cube is maintained automatically. The size of the cube is decreasing when using the column-store, not increasing by creating additional copies of the cube. There is no need to create new aggregates to support new or ad-hoc BW queries when using the column-store. Therefore the column-store is the new preferred means to increase BW query performance. Once you define a column-store index on a BW cube for Microsoft SQL Server, all existing aggregates of this cube are deactivated. This is done due to the fact that the BW OLAP processor is not aware of the column-store. Therefore using an aggregate (which never has a column-store index) is avoided, once the basis cube has a column-store index.

Comparing BW query performance

The easiest way to measure the runtime of a BW query is using SAP transaction RSRT. The runtime of a BW query consist of many components: SQL Query runtime, BW OLAP processor runtime, time for transferring the results set to the BW application server and the BW client (for example BEx or BOBJ) and the rendering of the results on the client.

When comparing BW query runtime between different systems or different configurations you have to ensure, that you measure under the same conditions. For example, when running a SQL query (as part of a BW query), SQL Server may or may not perform physical I/O. By repeating the same BW query you can ensure that all data is already in the SQL Server buffer cache. Alternatively you could run the SQL command DBCC DROPCLEANBUFFERS to ensure that no data is in the buffer cache.

If you want to measure the database performance, then you should turn off the SAP OLAP cache. Otherwise you typically measure the cache performance (single threaded performance of SAP application server) rather than the database performance. In the worst case, the database is not accessed at all, if the OLAP cache can be used. You can turn off the OLAP cache on BW cube level using "InfoProvider Properties" in SAP transaction RSA1:

Alternatively you can turn off the OLAP cache on BW query level using SAP transaction RSRT. However, in productive customer systems the OLAP cache is typically turned on. So why should you turn it off for performance tests? There are two reasons for this: Firstly, the likelihood of a fully filled OLAP cache is in a test environment much higher than in a productive system. Therefore you would benefit much more from the OLAP cache in a test system, which results in unrealistic measurements. Secondly, you typically want to tune the slowest BW queries running under the worst conditions (when the OLAP cache does not contain fitting entries by chance).

Some BW queries are independent from database performance by nature. When there is a big result set with millions of rows, then a huge part of the runtime is consumed for transferring the result set from the database server to the SAP application server, and finally to the BW client. In this case you are measuring the network throughput rather than the database performance.