This is the fifth post in a series of posts detailing the capabilities of OLAP cubes in the Data Warehouse, a new feature that enables self-service reporting functionality added in SCSM 2012.
OLAP Cube Processing
When a cube has been deployed and all its partitions have been created, it is ready to be processed so it is browsable. Processing a cube is the final step after ETL has been executed. The Data Flow Diagram below illustrates the entire process:
Figure: Data Flow Diagram illustrating how source data flows ultimately into a OLAP Cube
Processing a cube occurs when all aggregations for the cube are calculated and the cube is loaded with these aggregations and data. Dimension and fact tables are read and this data is calculated and loaded into the cube.
Processing must be carefully thought through due to the potentially significant ramifications it could have in a production environment where millions of records may exist. A full process of all partitions in such an environment could take anywhere from days to even weeks, which will render the system and cubes unusable to the end user. One suggestion is to disable the processing schedule of any cubes which are not being used to reduce the overhead on the system.
Processing of Cubes is broken down into two separate tasks:
Each cube will have a processing job added in the DW UI and will be performed on a user-configurable schedule. Let’s take a look at how we will handle each type of processing task.
Whenever a new dimension has been added to the AS DB, a Process Full MUST be executed on the dimension to bring it to a fully processed state. Once a dimension has been processed, however, it is not guaranteed it will be processed again when another cube that targets the same dimension is processed. This is to avoid unnecessary processing overhead, as we would like to avoid reprocessing every dimension for every cube especially if the dimension has been recently processed, since it is unlikely that new data exists that has not been yet been processed. To optimize processing efficiency, there is a singleton class defined in the MP Microsoft.SystemCenter.Datawarehouse.OLAP.Base called Microsoft.SystemCenter.Warehouse.Dimension.ProcessingInterval:
<!-- This singleton class defines the minimum interval of time in minutes that must elapse before a shared dimension is reprocessed. -->
<ClassType ID="Microsoft.SystemCenter.Warehouse.Dimension.ProcessingInterval" Accessibility="Public" Abstract="false" Base="AdminItem!System.AdminItem" Singleton="true">
<Property ID="IntervalInMinutes" Type="int" Required="true" DefaultValue="60"/>
This singleton class contains a property "IntervalInMinutes" which describes how often to process a dimension and by default is set to 60 minutes. For example, if a dimension was processed at 3:05 pm and another cube that targets the same dimension is processed at 3:45 pm, the dimension will not be reprocessed. One drawback to this approach, however, is the increased likelihood of dimension key errors. A retry mechanism is implemented to handle dimension key errors that will essentially reprocess the dimension and then the cube partition. For more information, please refer to the Processing Failures section in the"Common issues in debugging/troubleshooting" section.
Once a dimension has been fully processed, incremental processing via ProcessUpdate will be executed. The only other time that a ProcessFull will be executed is on a dimension schema change since that will result in the dimension returning to a unprocessed state. The user needs to be aware that if a ProcessFull is performed on a dimension, all affected cubes (and their partitions) will subsequently exist in a unprocessed state and will need to be fully processed on their next scheduled run.
Processing of partitions must be carefully thought out because a reprocess of a large partition is very slow and consumes many CPU resources on the AS machine. Partition processing in general will take longer than dimension processing. Unlike dimension processing, processing a partition has no side effects on other objects. The only two types of processing performed on SCSM 2012 cubes is a ProcessFull and ProcessAdd.
Similar to dimensions, the creation of any new partition in the cube requires a ProcessFull for the partition to be in a queryable state.
Since a ProcessFull is a expensive operation, the idea was to only perform a ProcessFull only when necessary (i.e - creation of partition, a row has been updated). In scenarios where we know that rows have been added and no rows have been updated, we can perform a ProcessAdd. In order to do this, the usage of watermarks and other metadata was leveraged. More specifically, the etl.cubepartition and the etl.tablepartition tables are queried to determine what type of processing to perform. The figures below show some sample data from these tables:
Figure: Sample data from the etl.TablePartition table with InsertedBatchId and UpdatedBatchId columns used by cube watermarks
Figure: Sample data from the etl.CubePartition table with with WatermarkBatchId
The following diagram illustrate how we determine what type of processing to perform based on the watermark data:
Note that when a ProcessAdd is performed, we will limit the scope of the query via the watermarks. For example, if the InsertedBatchId is 100 and the WatermarkBatchId is 50, the query will only load data from the datamart where the InsertedBatchId is greater than 50 and less than 100.
Finally, it is important to note that at this point in time we do NOT support manual processing of the cubes via AS or BIDS. Processing of cubes outside of methods provided by SCSM 2012 (UI or cmdlets) will not update the watermark tables and thus it is possible that data integrity issues will arise. If someone
accidentally manually reprocessed the cube, a possible workaround is to unprocess the cube manually in the same manner and then the next time SCSM processes the cube, it will automatically perform a ProcessFull since partitions will be in an unprocessed state. This will correctly update all watermarks and metadata so any possible data integrity issues will be resolved.