Azure Machine learning and loading external data sets into your experiment

Lee Stott · ‎Mar 21 2019

First published on MSDN on Mar 02, 2017

Amy Nicholson and I attended an event at University yesterday, Amy presentation was on the use of Azure ML Studio and how the University could effectively use Azure Machine Studio within their Machine Learning teaching, learning and research.

One of the questions we received at the end of the session was how to get large datasets from local computer to Azure ML.

The size limit for uploading local datasets directly to Azure ML is 1.98 GB.

To overcome this limitation and upload larger files, up to 10 GB, the recommended approach is through following 2 steps:

Stage the data to Microsoft Azure Blob Storage using AzCopy command-line utility
Use Reader module to import data from Blob to ML Studio

Note that for large files, bringing in datasets can take long time to complete, 10 minutes per GB of data or more.

Step 1: Stage Data to Blob Storage using AzCopy

First install AzCopy command-line utility on your local computer. Then start Command Prompt and use AzCopy to upload your file from local folder to blob storage:

cd "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy"

.\AzCopy.exe /Source:C:\LocalFolder /Dest:https://mystorage.blob.core.windows.net/mycontainer /DestKey:MyStorageAccountKey /Pattern:myfile.csv

Note: To optimize performance for next step, use South Central US as the region for your storage account. South Central US is the same region that Azure ML service uses.

Step 2: Use Reader Module to Import Data from Blob to ML Studio

Create a new blank experiment in Azure ML Studio. Drag Reader module to experiment canvas, and configure its parameters to read data from the blob created in Step 1:

Data source: AzureBlobStorage
Authentication type: Account
Account name: <mystorage>
Account key: < MyStorageAccountKey>
Path to container, directory or blob: <mycontainer>/<myfile.csv>

Run the experiment. Once the experiment has finished, right-click the output port of the Reader module and select “Save as Dataset”. Note that the Reader module will re-read the dataset every time the experiment is run, but saving the dataset will create a static copy that is available from “Saved Datasets” list in ML Studio.

If your interested in learning more about Azure Machine Learning

Here is a short introduction on Preprocessing Data in Azure Machine Learning Studio

Here is a short video on Predictive Modeling with Azure ML Studio

For more videos on Azure ML see https://channel9.msdn.com

Resources

This repository contains an Azure Machine Learning Student focused document for getting started. Looking into the Azure Machine Learning Studio, Gallery and Notebooks. This will take you from end-to-end building and deploying a model using the cloud service on Azure

https://github.com/amykatenicho/AzureMLStudentsPython

The Azure Workshop is a series of hands-on coding labs to help computer science faculty and student quickly learn how to deploy solutions to the Azure cloud across common scenarios like Web Dev, App Dev, Internet of Things, and Data Science with Machine Learning using cross-platform technologies. Labs can be completed on a Windows device or through VMs on Mac or Linux. Format is typically 1-day instructor-led; however groups may opt to customize into 2-hour or 4-hour lengths too. Your feedback is welcome in improving these labs.

https://github.com/MSFTImagine/computerscience/tree/master/Workshop

Data science in 5 steps with Microsoft Azure Machine Learning

https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/12/04/data-science-in-5-steps-with-...

A set of Machine Learning Resources

https://blogs.msdn.microsoft.com/uk_faculty_connection/?s=Machine+Learning

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Azure Machine learning and loading external data sets into your experiment

Resources