The Best Defense is a Good Offense: Security Tips for Azure Machine Learning Solutions

Microsoft

Feb 21, 2023

As cyberattacks grow more sophisticated and cloud solutions more complex, how does an engineering team prioritize security? A good offense.

The tips shared in this article are grounded on the three guiding principles at the core of the Zero Trust security model:

Principle of least privilege - Grant the minimum level of access rights needed to perform a particular task. This applies to both users and processes. When access is limited, attack surface is reduced, and potential damage is minimized.
Verify explicitly - Authenticate the identity of users or services requesting access to resources. Once authenticated, authorize by verifying permission and access rights. Implement both to prevent unauthorized access to data, services, and resources.
Assume breach - This principle is based on the idea that a solution, service, or data could be - or already has been - breached. The practice of assuming breach can proactively help guide design decisions, priorities, and operations to minimize the impact of a compromise.

Principle of least privilege: Who has access to the Azure Machine Learning workspace and resources? Where is the training data? And who has access?

Azure Machine Learning relies on multiple Azure resources including workspace, computing platforms, and data storage services. It is important to be familiar with the different types of Azure identities (user, security group, service principal, and managed identity) and how they integrate with Azure resources through Azure RBAC. The recommended practice is to manage access through built-in roles where possible and custom roles, if needed. This applies to resources within the workspace as well as any resource dependencies outside of the workspace like Azure SQL Database, ADLS Gen2, or Azure Storage. And for users with privileged accounts, an additional layer of access is recommended using controls like Just-In-Time (JIT), Just-Enough-Access (JEA), and risk based adaptive policies.

Resources to help:
- Use built-in Azure role-based control: Manage access to an Azure Machine Learning Workspace
- Restrict access to resources and operations

Verify explicitly: Who can access a model endpoint? How to protect a model endpoint?

For machine learning solutions, an endpoint is an entry access to the service. It is invoked when a client submits a request to the service and a response - like a prediction or scored value - is returned by the model to the endpoint. Azure Machine Learning supports three types of endpoints: real-time endpoints, batch endpoints, and managed online endpoints. To protect endpoints, consider:

Always configure authentication for an endpoint. Azure supports both key and token-based authentication. See real-time and batch endpoints.
For managed online endpoints, additional protection can be added by using a private endpoint which provides a layer of network isolation for both inbound and outbound communication.
Encrypt endpoints with TLS version 1.2 or greater. Encryption mitigates man-in-the-middle attacks and TLS version 1.2 is safer than previous versions.
For public facing endpoints:
- Avoid deploying endpoints on port 80 because requests and responses are sent in clear text. A safer option is to deploy to port 443.
- Use a firewall capability to filter traffic and techniques to limit network traffic to protect the service against Denial of Service (DoS) or brute force attacks.

References to help:

Assume breach: What mechanisms are in place to detect a compromise? To reduce the impact of a compromise? Where are secrets stored? How quickly can the service be recovered if compromised?

With the assume breach principle, defense becomes about protecting the service and resources through containment, response, and remediation. For Azure Machine Learning, this includes the following considerations:

Networking is one of the most important security controls. By default, Azure Machine Learning workspace endpoints are public endpoints. A recommended practice is to adopt a network security architecture that (1) disables public access to the workspace and compute clusters and (2) segments the machine learning solution components so a compromise of one component minimizes impact to other components.

References to help:
- A tutorial: How to create a secure workspace.

Infrastructure-as-code (IaC) is the practice of automating the configuration and deployment of Azure resources. Through the use of tools like Terraform or Azure Bicep, templates are created to provide for consistent, automated deployments of Azure resources. In the event of a compromise, IaC can minimize downtime by using templates to redeploy and restore services.

References to help:
- For guidance on developing a Terraform template for deployment of a secure Azure ML environment: aml-secure-terraform.

Use Azure Key Vault for storing keys, secrets, and application artifacts like database strings, endpoint URLs, and API keys in Key Vault. Do not store secrets in notebooks, scripts, or other code.

References to help:
- Use authentication credential secrets in Azure Machine Learning jobs,
- Scan code repositories to detect unintended hidden secrets. For code repositories in GitHub, use the native secret scanning feature to flag credentials or other secrets: GitHub secret scanning. There is also Credential Scanner (CredScan) which is a Microsoft tool to identify credential leaks in source code and configuration files.

Audit logs are essential for providing early visibility to activity and behavior that can indicate a compromise. By default, Azure captures platform level activity for every Azure resource but to enable auditing, additional resource-level configuration is needed. For Azure Machine Learning, a diagnostic setting needs to be created and data routed to single destination like an Azure Log Analytics Workspace. Audit data can then be analyzed and combined with other security signals to identify anomalies or potential threats with the service.

References to help:

Keep packages, libraries, and the hosting environment (docker, other compute technologies) updated with current releases. This includes code packages used for the machine learning development, pipeline, and deployment environments. When the compute environment is an Azure Machine Learning compute instance (managed cloud-based workstation), it is initially provisioned with the most current VM releases. To keep the image current, recreate it on a regular basis. When using an Azure Machine Learning compute cluster configure it with min nodes = 0, the nodes will be automatically upgraded to the most current VM image.

References to help:
- For suggestions about private package repository, managing environments, see Vulnerability Management for Azure Machine Learning.

Pickle files are a mechanism for storing python objects - like a model or data - and are convenient to use but not secure. A pickle file does not know the difference between a machine learning model and malicious shell code. If a malicious script is installed and executed, it could leave the environment vulnerable to remote code execution and get access to code, models, and secrets. Minimize the security risk and don’t download pickle files from the internet or untrusted sources.

References to help:
- Importing third-party packages and dependencies could introduce vulnerabilities into your environment and software supply chain. Adopt a practice of scanning packages for security vulnerabilities using a tool like pip-audit.

7. Enable Microsoft Defender for Cloud to scan for vulnerable service configurations and provide early warning of potential security problems. Monitor Secure Score and regularly review the recommendations for locking down services and the environment. Consider Defender for Cloud add-ons to protect specific workloads:

Defender for DevOps to scan and detect secrets and vulnerabilities in infrastructure-as-code, Azure DevOps, and GitHub repos.

Defender for Containers provides vulnerability assessments and run-time threat protection.

Safety, security, and reliability are important design considerations for any digital solution today including machine learning solutions. The purpose of this blog is to help contribute to awareness about what is available to help with Azure Machine Learning. Please let us know how we did!

For more information on Zero Trust: Embrace proactive security with Zero Trust (Microsoft white paper). For more information on reported python vulnerabilities, see CVE advisories issued for Python modules.

Updated Feb 21, 2023

Version 1.0

data & ai

KateB

Microsoft

Joined March 13, 2020

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity