(Nearly) Everything you need to know about computer vision in one repo
Published Dec 17 2019 11:12 AM 41K Views

This post was co-authored by @JS Tan, @Patrick Buehler, @Anupam Sharma and @Jun Ki Min


In recent years, we've seen extraordinary growth in Computer Vision, with applications in image understanding, search, mapping, semi-autonomous or autonomous vehicles and many more 

The ability for models to understand actions in a videoa task that was unthinkable just a few years ago, is now something that we can achieve with relatively high accuracy and in near real-time. 



 Action Recognition



However, the field is noparticularly welcoming for newcomers. Without prior experience or guidance, building an accurate classifier can easily take weeks. Unless you're ready to spend a long-time learning computer vision, it's extremely hard to master the basics, let alone begin to explore some of the cutting-edge technologies in the field. Even for computer vision experts, building a quick Proof of Concept (POC) can be nontrivial and could easily end up taking many days to put together.  


At Microsoft, we have been working for many years on diverse Computer Vision solutions for our customers and collected our learnings into our new public Microsoft repository: https://github.com/microsoft/ComputerVision-recipes 


The goal of this repository is to provide examples and best practice guidelines for building computer vision systems on Azure, and to share this with the open-source communityMore specifically, our goal was to create a repository that will help us to provide solutions rapidly to the community and to customers that we work with, or with on-boarding new team members who may have expertise in data science, but not specifically in computer vision. From mastering some of the most common scenarios in the field, like image classification, object detection, and image similarity, to exploring cutting edge scenarios like activity recognition and crowd counting, this repo will guide you through building models, fine-tuning them, and using them in real-world scenarios. 


We're kicking off our repo with 5 scenarios: 






Image Classification is a way to learn and predict the category of a given image. (Ex: Is the picture of a ‘dog’ or a ‘cat’?) 



Image Similarity is a way to compute a similarity score given a pair of images. Given an image, it allows you to identify the most similar images in dataset. (Ex: This picture of a dog is the most like which of the following images of animals?) 



Object Detection is a supervised machine learning technique that allows you to detect where on a given image an object of interest is. (Ex: Where in the image are there animals?) 

Action Recognition 


Action Recognition is used to identify in video footage what actions are performed and at what respective start/end times. (Ex: When is there someone drinking in the video?) 

Crowd Counting 


Crowd Counting is a use-case that leverages supervised machine learning techniques to count the number of people in an image – this applies to both low-crowd-density (e.g. less than 50 people) and high-crowd-density (e.g. thousands of people). (Ex. How many pedestrians are in this image of a street?) 


Rather than creating implementations from scratch, we draw from popular state-of-the-art libraries (e.g. fast.ai and torchvision), and we build additional utility around loading image data, optimizing models, and evaluating models. In addition, we aim to answer the frequently asked questions, try to explain the deep learning intuitions, and highlight common pitfalls.  


Whether you are an expert in computer vision or just getting your hands wet, we believe this repository offers something for you. For the beginner, this repo will guide you through building a state-of-the-art model and help you develop an intuition for the craft. For the experts, this repository can quickly get you to a strong baseline model which is easy to extend using custom Python/PyTorch code. In addition, the repository also aims to provide support with 1) the full data science process, and 2) the tooling to succeed on Azure. 


We hope that these examples and utilities will make it easier and faster for developers to create custom vision applications. 


The Data Science Process 

The Computer Vision Recipes GitHub repository shows you how to approach the five key steps of the data science process and provides utilities to enrich each of the steps: 


  1. Data preparation - Prepare and load your data. 
  2. Modeling - Build models using deep learning algorithms. 
  3. Evaluating – Evaluate your model. Depending on the metric you’re interested in optimizing, you may want to explore different methods of evaluation. 
  4. Model selection and optimization - Tune and optimize hyperparameters to get the highest performing model. Because Computer Vision models are often computationally costly, we show you how to seamlessly scale your parameter tuning into Azure. 
  5. Operationalizing - Operationalize models in a production environment on Azure by deploying it onto Kubernetes. 


Inside the computer vision recipes repo, we have added a lot of utility to support common tasks such as loading datasets in the format expected by different algorithms, splitting training/test data, and evaluating model outputs. 


Azure Machine Learning  

This computer vision repository also has deep integration with the Azure Machine Learning service to complement your work locally. We provide code examples on how you can optionally and easily scale your training into the cloud, and how you can deploy your models for production workloads.  


Azure Cognitive Services 

Note that for certain computer vision problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise.  


  • Vision Services are a set of pre-trained REST APIs which can be called for image tagging, OCR, video analytics, and more. These APIs work out of the box and require minimal expertise in machine learning but have limited customization capabilities. See the various demos available to get a feel for the functionality (e.g. Computer Vision). 
  • Custom Vision is a SaaS service to train and deploy a model as a REST API given a user-provided training set. All steps including image upload, annotation, and model deployment can be performed using either the UI or a Python SDK. Training image classification or object detection models can be achieved with minimal machine learning expertise. The Custom Vision offers more flexibility than using the pre-trained cognitive services APIs but requires the user to bring and annotate their own data. 


Before using the Computer Vision repository, we strongly recommend evaluating if these can sufficiently solve your problem. 


Scenario Example: Object Detection 

To give you a sense of how you can use our repo to build a state of the art (SOTA) model, here is a preview of how simple it is to create an Object Detection model. Of course, you can go much deeper and add custom PyTorch code, but getting started is as simple as this: 


1. Load your data 

The first step is to load your data – we help you do this with a simple object that automatically parses your data and the annotations: 




from utils_cv.detection.data import DetectionLoader 
data = DetectionLoader("path/to/data") 





2. Train/fine-tune your model 

Then we create a 'learner' object that helps you manage and train your model. By default, it will use torchvision's Faster R-CNN model. But you can easily switch it out. 




from utils_cv.detection.model import DetectionLearner 
detector = DetectionLearner(data) 





3. Evaluate 

Finally, lets evaluate our model using the built-in helper functions. We can look at the precision and recall curves to give us a sense of how our model is performing. 




from utils_cv.detection.plot import plot_pr_curves 
eval = detector.evaluate() 





As we continue to build out of repository, we will be looking for new computer vision scenarios to unlock. Feel free to reach out to cvbp@microsoft.com or post an issue if you wish to see us cover a scenario. 

Version history
Last update:
‎Mar 03 2020 09:53 AM
Updated by: