Figure 1: Demonstration of a deep learning model making sense of thousands of images by identifying their underlying categorical structure. Each dot represents the location of a sample image in the model's semantic representation of the dataset, known as the embedding space. t-SNE was used to create this 2D projection, which shows the model's representation of the underlying categorical structure across images (CIFAR-10 benchmark dataset). The color of the dot represents image class membership (10 classes: 'plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'), and dot opacity represents the model's confidence in the image class. Dots without color represent samples that the model hasn't learned yet.
Introduction
Organizations often sit on a treasure chest of unstructured data like video, speech and audio, which are difficult to analyze and distill into actionable insights. In this blog post series, we will cover how to combine #azureml and #azuresqldatabase to empower individuals who would like to derive value from large-scale unstructured datasets.
In this introductory post, we provide an overview of potential use cases, challenges, and solutions involving the use of large-scale, unstructured datasets. Our data solutions are based on active learning, a class of machine learning algorithms that support faster training by actively querying the user for labels of highly informative examples. Active learning is helpful when manual data labeling expensive. The machine 'learner' selects only the most informative examples for labeling, reducing the total number of labeled examples needed to learn a solution or concept.
In future posts, we will discuss the steps required to enable active learning at scale via implementation of PyTorch dataset classes, which load unstructured data (e.g., images) from Azure Blob Storage, read annotations from an Azure SQL database, and write model predictions to the same database. The resulting Azure SQL database supports quantitative analytics and insights that are unobtainable from unstructured data (see Figure 1).
Example Applications
Here are some example applications of the approach that we will introduce in the blog post series:
- Workplace Safety. Supervisors suspect that some worker behaviors lead to accidents. They have a very large repository of video footage and records of workplace accidents and would like to investigate whether specific behaviors have historically preceded accidents.
- Road Safety. Public employees suspect that a particular type of traffic intersection is associated with an increased number of traffic accidents. Employees have historical GIS data on traffic accidents and footage of traffic cameras. They train a model to categorize intersections via an active learning approach and join that data with GIS data on traffic accidents to test their hypothesis.
- Manufacturing. A manufacturer suspects that a particular kind of manufacturing defect leads to warranty claims later. The manufacturer has a large dataset of images from manufacturing pipelines. Investigators train a model to recognize the anomaly and join the data with warranty claims to investigate. Based on their findings, they can start a product recall to avoid costly warranty claims.
- Predictive Maintenance. Acoustic sensor data on manufacturing machines are designed to provide a signal that is predictive of outages and other equipment failures. Operators would like to know whether it is possible to join unstructured acoustic data from sensors with maintenance records to perform predictive maintenance.
-
Conservation and Environmental Studies. As part of a co-innovation project with the Jane Goodall Institute (JGI), MediaValet, and the University of Oxford, we used an active learning architecture to help researchers explore animal behavior. JGI digitized and uploaded many decades of videos of chimpanzees in the wild to enable primate researchers to use this data for quantitative scientific analyses. Building on groundbreaking work at the University of Oxford [1], we developed a no-code active learning solution for training state-of-the-art computer vision models. This solution helps researchers at JGI index and understand their unstructured data assets and join unstructured data with other structured data sources, enabling new statistical analyses and scientific inquiries.
Challenges
Organizations often face very significant challenges when trying to unlock insights from unstructured data. Key obstacles include:
- Expense: Data annotation is expensive. Sometimes the task of annotating data can be crowd-sourced, but there are many situations where data can only be annotated by subject matter experts who have other important tasks on their plate.
- Data volume: Organizations may have collected unstructured data for an extended period of time, and could potentially sit on millions of images, videos, or audio recordings.
- Expertise: Organizations may not have the required expertise in software engineering or data science to analyze unstructured data at scale.
Minimizing data labeling costs with active learning
To address the challenge that data annotation is expensive, especially when data has to be annotated by subject matter experts, we developed an active learning solution. Active learning is a machine learning technique that tries to minimize required labeling efforts by strategically selecting samples for annotation that are expected to benefit the model the most.
Azure SQL Server and Database enable active learning at scale
Dataset size can also be a challenge for organizations that explore an active learning solution. A common approach to training deep learning models is to store annotations in JSON format or CSV files for the annotations to be loaded into host memory at the beginning of training.
While there are several workarounds for more advanced use cases, we decided to use Azure Blob Storage and SQL DB for this project, which immediately alleviated all concerns around increasing dataset size. Azure Blob Storage (Azure general-purpose v2 (GPv2)) supports containers of a size of up to 5 PiB (5,630 terabytes), and Azure SQL can easily handle trillions of rows of annotation data, with a maximum database size of 524,272 terabytes. Using a clustered index on the Azure SQL tables allows for very rapid reading and writing to SQL tables.
Azure SQL DB provides several advantages for a project of this scale, including:
- Storage: Memory limitations on the training host machines used for model training and inference are no longer an issue because there is no requirement to load the annotations for the entire dataset in memory.
- Speed: This approach scales extremely well as a dataset grows, because Azure SQL DB has no issues handling a dataset of this size.
Once the trained deep learning model has been applied to index the dataset, the results can also be stored in the SQL database, allowing data scientists and analysts to calculate descriptive statistics, perform inferential statistics, or join the table with other tables.
Azure ML enables the automation of model training and monitoring
Another challenge can be that the individuals who annotate the data may have no experience with software engineering and data science. We will therefore discuss how to build a no-code solution. To this end, we are orchestrating a set of Azure ML Pipelines that trigger automatic execution in response to well-defined events. These pipelines automate data ingestion, model training and re-training, monitoring for model and data drift, batch inference, and active learning.
Related Tools and Services
Azure ML Data Labeling. Data Labeling in Azure Machine Learning offers a powerful web interface within Azure ML Studio that allows users to create, manage, and monitor labeling projects. To increase productivity and decrease costs for a given project, users can take advantage of the ML-assisted labeling feature, which uses Azure ML Automated ML computer vision models under the hood. However, in contrast to the approach described here, Azure ML Data Labeling does not support active learning.
Azure Custom Vision service is a mature and convenient managed service that allows customers to label data and to train and deploy computer vision models. In contrast to the approach discussed here, the focus is on developing a state-of-the-art vision model rather than understanding and indexing very large amounts of unstructured data. Like the Azure ML Data Labeling tool above, it does not have support for active learning.
Video Indexer is a powerful managed service for indexing large assets of video data. It currently offers limited options for customizing models to understand the subject domain of custom datasets and does not allow users to apply the generated index for secondary analysis using straightforward, built-in features.
Conclusion
This blog post is the first in a series focusing on combining Azure SQL Database and Azure ML to index and understand very large repositories of unstructured data. Future blog posts will describe each of the topics touched upon above in more detail, including:
- Writing a PyTorach Dataset class for SQL
- Implementing Active Learning at scale with SQL DB and Azure ML
- Optimizing SQL tables and queries to increase training and inference speed
- Ensuring AI fairness
- Gaining scientific insights after all unstructured data has been indexed
We also welcome requests in the comment section for other topics that you would like us to cover in these future blog posts.
References
- Schofield, D., Nagrani, A., Zisserman, A., Hayashi, M., Matsuzawa, T., Biro, D., & Carvalho, S. (2019). Chimpanzee face recognition from videos in the wild using deep learning. Science advances, 5(9), eaaw0736.