Figure 1: Demonstration of a deep learning model making sense of thousands of images by identifying their underlying categorical structure. Each dot represents the location of a sample image in the model's semantic representation of the dataset, known as the embedding space. t-SNE was used to create this 2D projection, which shows the model's representation of the underlying categorical structure across images (CIFAR-10 benchmark dataset). The color of the dot represents image class membership (10 classes: 'plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'), and dot opacity represents the model's confidence in the image class. Dots without color represent samples that the model hasn't learned yet.
Organizations often sit on a treasure chest of unstructured data like video, speech and audio, which are difficult to analyze and distill into actionable insights. In this blog post series, we will cover how to combine #azureml and #azuresqldatabase to empower individuals who would like to derive value from large-scale unstructured datasets.
In this introductory post, we provide an overview of potential use cases, challenges, and solutions involving the use of large-scale, unstructured datasets. Our data solutions are based on active learning, a class of machine learning algorithms that support faster training by actively querying the user for labels of highly informative examples. Active learning is helpful when manual data labeling expensive. The machine 'learner' selects only the most informative examples for labeling, reducing the total number of labeled examples needed to learn a solution or concept.
In future posts, we will discuss the steps required to enable active learning at scale via implementation of PyTorch dataset classes, which load unstructured data (e.g., images) from Azure Blob Storage, read annotations from an Azure SQL database, and write model predictions to the same database. The resulting Azure SQL database supports quantitative analytics and insights that are unobtainable from unstructured data (see Figure 1).
Here are some example applications of the approach that we will introduce in the blog post series:
Conservation and Environmental Studies. As part of a co-innovation project with the Jane Goodall Institute (JGI), MediaValet, and the University of Oxford, we used an active learning architecture to help researchers explore animal behavior. JGI digitized and uploaded many decades of videos of chimpanzees in the wild to enable primate researchers to use this data for quantitative scientific analyses. Building on groundbreaking work at the University of Oxford [1], we developed a no-code active learning solution for training state-of-the-art computer vision models. This solution helps researchers at JGI index and understand their unstructured data assets and join unstructured data with other structured data sources, enabling new statistical analyses and scientific inquiries.
Organizations often face very significant challenges when trying to unlock insights from unstructured data. Key obstacles include:
To address the challenge that data annotation is expensive, especially when data has to be annotated by subject matter experts, we developed an active learning solution. Active learning is a machine learning technique that tries to minimize required labeling efforts by strategically selecting samples for annotation that are expected to benefit the model the most.
Dataset size can also be a challenge for organizations that explore an active learning solution. A common approach to training deep learning models is to store annotations in JSON format or CSV files for the annotations to be loaded into host memory at the beginning of training.
While there are several workarounds for more advanced use cases, we decided to use Azure Blob Storage and SQL DB for this project, which immediately alleviated all concerns around increasing dataset size. Azure Blob Storage (Azure general-purpose v2 (GPv2)) supports containers of a size of up to 5 PiB (5,630 terabytes), and Azure SQL can easily handle trillions of rows of annotation data, with a maximum database size of 524,272 terabytes. Using a clustered index on the Azure SQL tables allows for very rapid reading and writing to SQL tables.
Azure SQL DB provides several advantages for a project of this scale, including:
Once the trained deep learning model has been applied to index the dataset, the results can also be stored in the SQL database, allowing data scientists and analysts to calculate descriptive statistics, perform inferential statistics, or join the table with other tables.
Another challenge can be that the individuals who annotate the data may have no experience with software engineering and data science. We will therefore discuss how to build a no-code solution. To this end, we are orchestrating a set of Azure ML Pipelines that trigger automatic execution in response to well-defined events. These pipelines automate data ingestion, model training and re-training, monitoring for model and data drift, batch inference, and active learning.
Azure ML Data Labeling. Data Labeling in Azure Machine Learning offers a powerful web interface within Azure ML Studio that allows users to create, manage, and monitor labeling projects. To increase productivity and decrease costs for a given project, users can take advantage of the ML-assisted labeling feature, which uses Azure ML Automated ML computer vision models under the hood. However, in contrast to the approach described here, Azure ML Data Labeling does not support active learning.
Azure Custom Vision service is a mature and convenient managed service that allows customers to label data and to train and deploy computer vision models. In contrast to the approach discussed here, the focus is on developing a state-of-the-art vision model rather than understanding and indexing very large amounts of unstructured data. Like the Azure ML Data Labeling tool above, it does not have support for active learning.
Video Indexer is a powerful managed service for indexing large assets of video data. It currently offers limited options for customizing models to understand the subject domain of custom datasets and does not allow users to apply the generated index for secondary analysis using straightforward, built-in features.
This blog post is the first in a series focusing on combining Azure SQL Database and Azure ML to index and understand very large repositories of unstructured data. Future blog posts will describe each of the topics touched upon above in more detail, including:
We also welcome requests in the comment section for other topics that you would like us to cover in these future blog posts.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.