How would you leverage technology capable of generating natural language image descriptions that are, in many cases, just as good or better than what a human could produce? What if that capability is just one cloud API call away? Would you create live scene captions for people who are blind or low vision to better understand the world around them, like Seeing AI?
With Azure Cognitive Services, you can now take advantage of state-of-the-art image captioning that has achieved human parity on captioning benchmarks thanks to advancements in the underlying AI model. Below are some examples showing how the improved model is more accurate than the old one:
Improved model: A trolley on a city street
Old model: a view of a city street
Improved model: A person using a microscope
Old model: A person sitting at a table using a laptop
Now, let us take a closer look at the technology and how to easily harness its power for your users.
Behind the Scenes of the Technology
The novel object captioning at scale (nocaps) challenge evaluates AI models on their abilities to generate image captions describing new objects that are not present in their training data. Microsoft’s Azure AI team pioneered the Visual Vocabulary (VIVO) pre-training technique that led to the industry first of surpassing human performance on the (nocaps) benchmark. Before we learn more about this innovation, we should understand Vision and Language Pre-training (VLP) first. It is a cross-modality (across vision and language) learning technique that uses large-scale image/sentence data pairs to train machine learning models capable of generating natural language captions for images. However, because visual concepts are learned from image/sentence pairs which are costly to obtain, it is difficult to train a broadly useful model with wide visual concept coverage. This is where VIVO pre-training comes in. It improves and extends VLP to allow rich visual concepts to be learned from easier to obtain image/word pairs (instead of sentence) to build a large-scale visual vocabulary. While natural language sentence generation is still trained with limited visual concepts, the resulting image caption is cleverly enriched by new objects from the large-scale visual vocabulary.
Figure 1: VIVO pre-training uses paired image-tag data to learn a rich visual vocabulary where image region features and tags of the same object are aligned. Fine-tuning is conducted on paired image-sentence data that only cover a limited number of objects (in blue). During inference, our model can generalize to describe novel objects (in yellow) that are learnt during VIVO pre-training.
Please see this MSR blog post to learn more about VIVO pre-training.
Try the Service in Your App
Imagine you would like to generate alternative text descriptions for images your users upload to your app. Azure Computer Vision Service with its much improved “describe image” (image captioning) capability can help. Let us take it for a spin.
We will be using Python client library to invoke the service in this blog post. Try these links if you prefer a different language or invoking the REST API directly.
Optionally, replace the value of remote_image_url with the URL of a different image for which to generate caption.
Also, optionally, set useRemoteImage to FALSE and set local_image_path to the path of a local image for which to generate caption.
Save the code as a file with an .py extension. For example, describe-image.py.
Open a command prompt window.
At the prompt, use the python command to run the sample. For example, python describe-image.py.
from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from msrest.authentication import CognitiveServicesCredentials
# Best practice is to read this key from secure storage,
# for this example we'll embed it in the code.
subscription_key = "<your subscription key here>"
endpoint = "<your endpoint here>"
# Create the computer vision client
computervision_client = ComputerVisionClient(
# Set to False if you want to use local image instead
useRemoteImage = True
# Get caption for a remote image, change to your own image URL as appropriate
remote_image_url = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/house.jpg"
description_results = computervision_client.describe_image(
# Get caption for a local image, change to your own local image path as appropriate
local_image_path = "<replace with local image path>"
with open(local_image_path, "rb") as image:
description_results = computervision_client.describe_image_in_stream(
# Get the first caption (description) from the response
if (len(description_results.captions) == 0):
image_caption = "No description detected."
image_caption = description_results.captions.text
print("Description of image:", image_caption)