In this blog, we discuss different strategies for archiving data within a database. These strategies are independent of and not related to high availability and disaster recovery strategies.
Queries on very large tables can cause performance issues, both for the individual query and for all users, depending on available indexes, up-to-date statistics on the indexes and plan selection. To mitigate these potential issues, removing stale or unused data is a recommended activity. Removing data can be accomplished in several ways, depending on whether it needs to be kept and whether ongoing access is required.
Removing Old Data
Deleting old data is one way of improving the performance of large tables, but generally there is some data retention policy that prevents this. Even a simple delete operation from a very large table can cause significant performance issues, so the recommended way of mitigating this issue is to use table and index partitioning.
If you need to delete data from a non-partitioned table, the best approach is to do deletes of smaller groups of rows, so the delete operation does not escalate to requiring a table lock. If you are deleting historical records, the creation of data inconsistencies for historical data by these partial data deletes needs to be managed, unless there is some guarantee that the data is not accessed. Depending on the tool you are using, a delete can be turned into a move of data using the delete commands output clause. Updating statistics at the end of large delete operations is also recommended – for large tables, our recommendation is to not specify the sampling rate for the update of statistics.
A delete operation localized in a partition will be much faster than deleting from a large unpartitioned table. As a side effect of partitioning, you may also get performance benefits in certain queries via partition elimination, which isolates queries to only specific partitions, since the query operates on a smaller part of the table, it should run faster. If you align your archiving and partition strategy, you can make the removal of whole table partition a metadata operation, which can be done almost instantaneously. Azure SQL DB does not allow addition of filegroups but does support partitioning on the primary filegroup. Be aware that a single filegroup does limit the flexibility possible with partitioning.
Using the various forms of compression does not reduce the number of rows in a table, but it can improve database performance, because the row density of database pages increases and therefore the number of pages that need to be read to satisfy a query will be lower. Note that the reduced size comes at the cost of increased CPU usage. Generally, in most SQL Server systems, CPU consumption is not a bottleneck, so turning on compression generally has an overall positive effect. It is best to test the performance impact of compression before turning it on in a production system since every workload will react differently to the overhead of compression/decompression.
There are three forms of compression; row, page and cluster column store (CCI). Note that row and page compression can be different for different partitions within the same table. Also note that for row compression, certain data types are not compressed and off row column values are never compressed. Page compression uses a few additional techniques to compress data at the page level and will likely give you better compression. Note that partitioned tables can use different compression settings for each partition, while for unpartitioned tables, the compression setting applies to the whole heap or clustered index. In addition to the above forms of compress, with some extra TSQL code, large character and binary column values can be stored gzip compressed using the compress function and retrieved using the decompress function.
Cluster column store indexes compress the data by column, so generally, since the data is in the same domain, the resulting compression is even better. There is an additional level of column store compression called archival compression, which gets even better compression rates at the cost of additional compression and access time and CPU usage. Note that not all column data types are supported in a CCI, so this may be a limiting factor. The compression you achieve in a CCI also depends on how you build the CCI or load the data, since row groups are compressed in groups of 1 million rows. Be aware of this and check your segment density/quality, also keep in mind how CCIs interact with any table partitioning scheme and consider the minimum number of rows per partition.
For the SQL Server box product (on premises or VM), creating a tiered storage structure for tables is possible and should provide some cost savings. Using a separate filegroup, you can create one or more files on slower/cheaper storage and reduce storage costs for older data by explicitly placing the data on that filegroup. In row data for a table is stored in one file group, but large objects (i.e. varchar(max)) and each index can be stored on different filegroups. Recall that the clustered index on a table contains the actual data. Although you can put indexes and data on different filegroups with different performance characteristics, doing this is likely a bad idea because indexes are updated on every insert, update and delete operation. If large object column values (i.e. varchar(max)) are not accessed frequently, putting them on cheaper storage would likely result in cost savings.
Using tiered storage within a single table would require that you move specific rows to a different filegroup, which required that you create a new table (say “archive”) on the other filegroup with the same schema as the source table and move data from the base table to the archive table. If necessary, you can even put a view over both tables to present a combined single table to users. The “archive” table can also have a different indexing mechanism, i.e. CCI to improve compression and access performance to compensate for slower storage.
If you are partitioning your table, you can explicitly put different partitions on different filegroups. You can therefore tier storage and put older data on slower storage. When using managed disks with SQL on IaaS in Azure, disks can easily be converted from Premium SSD to Standard SSD or even HDD. This does require some maintenance, moving older partitions from faster storage to slower storage, since this is not done automatically. This page and the related page have descriptions of doing exactly this.
With Azure SQL DB and MI, if you don’t need data stored in a database, you can use serverless (DB) or start/stop (MI in preview) to save compute costs by pausing the instance. Note that you still pay for storage.
For the SQL Server box product running on a virtual machine, either on premises or in the cloud, you can pause the virtual machine to save compute costs.
Cold archiving involves extracting the data from the table and storing it on inexpensive archival media or in Azure Data Lake Storage (ADLS). Since you are extracting data and then deleting it, the most efficient way to do this is using partitions, as described above. There are many different ways to extract data and store it in text and other format files, including:
For the SQL Server box product - a simple strategy is to copy data to archive to another database, detach the data and log files of the database and copy them to inexpensive storage like ADLS. To access the data would require that re-attach the files to a SQL Server (which could be a SQL Server running in an Azure VM).
If Polybase is enabled on the SQL Server box product, it is possible to write data to external tables and thereby write directly into alternative file formats. Data virtualization on MI does not yet have the capability to write to an external tables.
Bcp the data to a file using a query to generate the resultset. Depending on the format you write into, you will need to load the data back into SQL Server to read it (i.e. -n format).
SqlPackage can extract both the schema and the data for a whole database (either SQL DB or SQL Server box running on premises or in the cloud).
SSIS can extract specific data and write it to several file formats. Since SSIS is included with SQL Server box, this strategy makes the most sense with SQL Server on premises or in a cloud based VM.
ADF has the ability to extract data from SQL Server and write it into Azure Data Lake directly and since it is a cloud service, it makes the most sense to archive data for any cloud-based SQL Server. Storing the data in parquet format takes advantage of column level compression, while avro (or orc) are row-based data stores which are a bit easier to use for row based datasets.
Data Access to Data in Cold Storage
Data stored in ADLS in certain file formats (csv, parquet, orc) can be accessed using external tables by SQL Server box product using Polybase. Azure Synapse can also use external tables to access files in ADLS.
For both Azure SQL DB and MI, an external data source can be used to access files in blob storage/ADLS, but only to bulk load them back into a database (not create an external table, although this feature is coming in MI).
Another strategy for exposing ADLS data to MI might be to create an Azure SQL Virtual Machine and use Polybase to expose this data to the SQL Server engine and use the linked server capability from MI to connect to the SQL VM.