01 Getting Started with Data in Azure Machine Learning
Published Apr 09 2024 07:00 AM 2,120 Views
Microsoft

The Azure Machine Learning Datastore 

Azure Machine Learning (AML) has a concept of a datastore that allows you to easily reference an existing storage account via API that offers a wide range of capabilities for interacting with various storage types, such as Blob, Files, and ADLS. Notably, this API is designed to facilitate the effortless discovery of valuable datastores within team operations, thereby enhancing operational efficiency. 
 
One salient feature of this API is its implementation of a secure approach to connection information management. Users can leverage credential-based access, whether through a service principle, SAS (shared access signatures), or key, to ensure the confidentiality and integrity of their data. By employing this methodology, the need to embed sensitive connection details within scripts is eliminated, mitigating potential security risks.  Datastores become very useful when you start to setup automation using AML Pipelines or start your journey into MLOps (Machine Learning Ops).  

 

Datastores give you the ability to access Azure Blob storage, ADLS Gen 2, Azure Files, and Microsoft Fabric OneLake. The following repo will get you started on creating your first AML Datastore. 

Github Repo 

https://github.com/Azure/azureml-examples/blob/main/sdk/python/resources/datastores/datastore.ipynb 

 

   

Azure Machine Learning “Connections” 

In certain scenarios, data may not be housed within Azure Blob Storage or OneLake, but rather in S3. In such cases, Azure Machine Learning (AML) offers a viable solution by enabling the creation of connections to data residing in Snowflake, Azure SQL DB, or even AWS S3.  
 
When data is stored in Snowflake, Azure SQL DB, or AWS S3, AML presents the option to create a connection, seamlessly bridging the gap between external data sources and AML's analytical capabilities. By establishing a connection, users can effortlessly access and analyze data from these disparate sources, without the burden of intricate configuration or manual intervention. 
 
AML prioritizes data security and ensures that credentials are securely stored. The Connection feature in AML securely stores credentials within the Workspace Key Vault. Once credentials are stored, direct interaction with the Key Vault becomes unnecessary, as AML handles authentication and authorization seamlessly in the background. This approach ensures the confidentiality and integrity of your credentials, bolstering the overall security posture of your data. 
Github repository,  

https://github.com/Azure/azureml-examples/blob/main/sdk/python/resources/connections/connections.ipy... 

 

Azure Machine Learning Data Asset 

An Azure Machine Learning data asset can be likened to web browser bookmarks or favorites. Rather than having to recall lengthy storage paths (URIs) that direct you to your frequently accessed data, you have the option to create a data asset and conveniently access it using a friendly name. 
 
When a data asset is created, it not only establishes a reference to the data source location but also retains a copy of its metadata. Since the data remains in its original location, there are no additional storage costs incurred, and the integrity of the data source is not compromised. Data assets can be created from various sources such as Azure Machine Learning datastores, Azure Storage, public URLs, or local files.  Azure Data Assets can be created using the Azure CLI, Python SDK or in the AML Studio.  

 

Azure Machine Learning MLTable 

With Azure Machine Learning, you can utilize a Table type (mltable) to create a blueprint that specifies the process of loading data files into memory as either a Pandas or Spark data frame. 

 

An MLTable file is a YAML-based file that serves as a blueprint for data loading. It allows you to define various aspects of the data loading process. Within the MLTable file, you have the flexibility to specify the storage location(s) of the data, which can be local, in the cloud, or on a public http(s) server. You can utilize globbing patterns over cloud storage to specify sets of filenames using wildcard characters (*). This provides a convenient way to handle multiple files within a specified location.  The MLTable file also allows you to define read transformations, such as specifying the file format type (delimited text, Parquet, Delta, json), delimiters, headers, and more. This ensures that the data is read correctly based on its format.  Lastly, you have the ability to define subsets of data to load. This includes filtering rows, keeping or dropping specific columns, and taking random samples. These options provide flexibility in loading only the necessary data for your specific needs. 

 

Github Repo 

https://github.com/Azure/azureml-examples/tree/main/sdk/python/using-mltable 

 

 

Reading from a Delta Table 

Delta table is a significant component of the Delta Lake open-source data framework. Typically employed in data lakes, Delta tables facilitate data ingestion through streaming or large batch processes. They serve as an open-source storage layer that enhances the reliability of data lakes by introducing a transactional storage layer atop cloud storage systems such as AWS S3, Azure Storage, and GCS. This integration enables features like ACID (atomicity, consistency, isolation, durability) transactions, data versioning, and rollback capabilities, streamlining the handling of both batch and streaming data in a unified manner. Delta tables, built on this storage layer, offer a table abstraction that simplifies working with extensive structured data using SQL and the DataFrame API. 

 

AML MLTable also supports the Delta format for reading data and even converting it to a Pandas Dataframe.  Git Repo  

https://github.com/Azure/azureml-examples/blob/main/sdk/python/using-mltable/delta-lake-example/delt... 

 

Azure machine learning – Managed Feature store 

 

Managed feature store in azure machine learning provides a centralized repository that enables data scientists and machine learning professionals to independently develop, productionize and also share features with other business units within your organizations. Features serve as the input data for your model. With a feature set specification, the system handles serving, securing, and monitoring of the features. This frees you from the overhead of underlying feature engineering pipeline set-up and management.  

 

Feature store allows you to search and reuse features created by your team, to avoid redundant work and deliver consistent predictions. Any new derived features created with transformations, can address feature engineering requirements in an agile, dynamic way across multiple workspaces when shared. The system operates and manages the feature engineering pipelines required for transformation and materialization to free your team from the operational aspects. 

 

https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/announcing-managed-feature-store-in-...  

 

Conclusion: This blog has covered several mechanisms for accessing data across a variety of sources. Each method offers features such as convenience, security, or cost savings which should be balanced against the requirements of the situation. In each case, a GitHub resource has been provided as a practical example to assist in learning about data management in the cloud.  

 

Principal author:

  • Shep Sheppard | Senior Customer Engineer, FastTrack for ISV and Startups

Other contributors:

  • Yoav Dobrin Principal Customer Engineer, FastTrack for ISV and Startups
  • Jones Jebaraj | Senior Customer Engineer, FastTrack for ISV and Startups
  • Olga Molocenco-Ciureanu | Customer Engineer, FastTrack for ISV and Startups
Co-Authors
Version history
Last update:
‎Mar 28 2024 10:37 AM
Updated by: