In this post, we’re challenging traditional approaches to SQL deployments and rethinking some long-standing beliefs. We propose a transformative approach inspired by modern DevOps tooling, such as Terraform we aim to align database deployments with the rest of the Software Delivery Lifecycle (SDLC).
Introduction
This is the next post in our series on CI/CD for SQL projects. In this post we will challenge some long held beliefs on how we should manage SQL Deployments. Traditionally we've always had this notion that we should never drop data in any environment. Deployments should almost extensively be done via SQL scripts and manually ran to ensure completion and to prevent any type of data loss. We will challenge this and propose a solution that falls more in line with other modern DevOps tooling and practices. If this sounds appealing to you then let's dive into it.
Why
We've always approached the data behind our applications as the differentiating factor when it comes to Intellectual Property (IP). No one wants to hear the words that we've lost data or that the data is unrecoverable. Let me be clear and throw a disclaimer on what I am going to propose, this is not a substitute for proper data management techniques to prevent data loss. Rather we are going to look at a way to thread the needle on keeping the data that we need while removing the data that we don't.
Shadow Data
We've all heard about "shadow IT", well what about "shadow data"? I think every developer has been there. For example, taking a backup of a table/database to ensure we don't inadvertently drop it during a deployment. Heck sometimes we may even go a step further and backup this up into a lower environment. The caveat is that we very rarely ever go back and clean up that backup. We've effectively created a snapshot of data which we kept for our own comfort. This copy is now in an ungoverned, unmanaged, and potentially in an insecure state. This issue then gets compounded if we have automated backups or restore to QA operations. Now we keep amplifying and spreading our shadow data.
Shouldn't we focus on improving the Software Delivery Lifecycle (SDLC), ensuring confidence in our data deployments? Let's take it a step further and shouldn't we invest in our data protection practice? Why should we be doing this when we have technology that backs up our SQL schema and databases?
Another consideration, what about those quick "hot fixes" that we applied in production. The ones where we changed a varchar() column length to accommodate the size of a field in production. I am not advocating for making these changes in production...but when your CIO or VP is escalating since this is holding up your businesses Data Warehouse and you so happen to have the SQL admin login credentials...stuff happens.
Wouldn't it be nice if SQL had a way to report back that this change needs to be accommodated for in the source schema? Again, the answer is in our SDLC process.
So, where is the book of record for our SQL schemas? Well, if this is your first read in this series or if you are unfamiliar with source control I'd encourage you to read Leveraging DotNet for SQL Builds via YAML | Microsoft Community Hub where I talk about the importance of placing your database projects under source control. The TL/DR...your database schema definitions should be defined under source control, ideally as a .sqlproj.
Where Terraform Comes In
At this point I've already pointed out a few instances on how our production database instance can differ from what we have defined in our source project. This certainly isn't anything new in software development. So how does other software development tooling and technologies account for this?
Generally, application code simply gets overwritten, and we have backup versions either via release branches, git tags, or other artifacts. Cloud infrastructure can be defined as Infrastructure as Code (IaC) and as such still follow something similar to our application code workflow. There are two main flavors of IaC for Azure: Bicep/ARM and Terraform.
Bicep/ARM adheres to an incremental deployment, which has its pros and cons. The quick version is Azure Resource Manager (ARM) deployments will not delete resources that are not defined in its template. Part of this has led to Azure Deployment Stacks which can help enforce resource deletion when it's been removed from a template.
If interested in understanding a Terraform workflow I will point you to one of my other posts on the topic.
At a high level Terraform evaluates your IaC definition and determines what properties need to be updated, and more importantly, what resources need to be removed. Now how does Terraform do this and more importantly, how can we tell what properties will be updated and/or removed?
Terraform has a concept known as a plan. This plan will run your deployment against what is known as the state file, in Bicep/ARM this is the Deployment Stack, and produce a summary of changes that will occur. This includes new resources to be created, modification of existing resources, and deletion of resources previously deployed to the same state file.
Typically, I recommend running a Terraform plan across all environments at CI. This ensures one can evaluate changes being proposed across all potential environments and summarize these changes at the time of the Pull Request (PR). I then advise re-executing this plan prior to deployment as a way to confirm/re-evaluate if anything has been updated since the original plan ran. Some will argue the previous plan can be "approved" to deploy to the next environment; however, there is little overhead in running a second plan and I prefer this option.
Here's the thing....SQL actually has this same functionality.
Deploy Reports
Via SqlPackage there is additional functionality we can leverage with our .dacpacs. We are going to dive a little more on Deploy Reports. If you have followed this series, you may know we use the SqlPackage Publish command wrapped behind the SqlAzureDacpacDeployment@1 task. More information on this can be found at Deploying .dapacs to Azure SQL via Azure DevOps Pipelines | Microsoft Community Hub .
So, what is a Deploy Report? A Deploy Report is the XML representation of the changes your .dacpac will make to a database. Here is an example of one denoting that there is a risk of potential data loss:
This report is the key to our whole argument for modeling a SQL Continuous Integration/Continous Delivery after one that Terraform uses. We already will have a separate .dacpac file, built from the same .sqlproj, for each environment when leveraging pre/post scripts as we saw in Deploying .dacpacs to Multiple Environments via ADO Pipelines | Microsoft Community Hub.
So now we need to take each one of those and run a Deploy Report against the appropriate target. This is the same as effectively running a `tf plan` with a different variable file against each environment to determine what actions a Terraform `apply` will execute.
These Deploy Reports are then what we will include in our PR approval to validate and approve any changes we will make to our SQL database.
Dropping What's Not in Source Control
This is the controversial part and the biggest sell in our adoption of a Terraform like approach to SQL deployments. It has long been considered a best practice to have whatever is deployed match what is under source control. This provides for a consistent experience when developing and then deploying across multiple environments.
Within IaC, we have our cloud infrastructure defined in source control and deployed across environments. Typically, it is seen as a good practice to delete resources which have been removed from source control. This helps simplify the environment, reduces cost, and reduces potential security surface areas.
So why not the same for databases? Typically, it is due to us having the fear of losing data. To prevent this, we should have proper data protection and recovery processes in place. Again, I am not addressing that aspect. If we have those accounted for, then by all means, our source control version of our databases should match our deployed environments.
What about security and indexing? Again, this can be accounted for in Deploying .dacpacs to Multiple Environments via ADO Pipelines | Microsoft Community Hub. Where we have two different post deployment security scripts, and these scripts are under source control! How can we see if data loss will occur? Refer back to the Deploy Reports for this!
There potentially is some natural hesitation as the default method for deploying a .dacpac has safeguards to prevent deployments in the event of potential data loss. This is not a bad thing as it prevents a destructive activity from automatically occurring; however, we by no means need to accept the default behavior.
We will need to refer to SqlPackage Publish - SQL Server | Microsoft Learn. From this list we will be able to identify and explicitly set the value for various parameters. These will enable our package to deploy even in the event of potential data loss.
Conclusion
This post hopefully challenges the mindset we have when it comes to database deployments. By taking an approach that more closely relates to modern DevOps practices, we can gain confidence that our source control and database match, increased reliability and speed with our deployments, and we are closing potential security gaps in our database deployment lifecycle.
This content was not designed to be technical. In our next post we will demo, provide examples, and talk through how to leverage YAML Pipelines to accomplish what we have outlined here. Be sure to follow me on LinkedIn for the latest publications.
For those who are technically sound and want to skip ahead feel free to check out my code on my GitHub : https://github.com/JFolberth/cicd-adventureWorks and https://github.com/JFolberth/TheYAMLPipelineOne