Apps can now narrate what they see in the world as well as people do
Published Oct 14 2020 08:00 AM 11.9K Views

How would you leverage technology capable of generating natural language image descriptions that are, in many cases, just as good or better than what a human could produce? What if that capability is just one cloud API call away? Would you create live scene captions for people who are blind or low vision to better understand the world around them, like Seeing AI?


With Azure Cognitive Services, you can now take advantage of state-of-the-art image captioning that has achieved human parity on captioning benchmarks thanks to advancements in the underlying AI model. Below are some examples showing how the improved model is more accurate than the old one:



press2.png press8.png

Improved model: A trolley on a city street

Old model: a view of a city street

Improved model: A person using a microscope

Old model: A person sitting at a table using a laptop


Now, let us take a closer look at the technology and how to easily harness its power for your users.


Behind the Scenes of the Technology

The novel object captioning at scale (nocaps) challenge evaluates AI models on their abilities to generate image captions describing new objects that are not present in their training data. Microsoft’s Azure AI team pioneered the Visual Vocabulary (VIVO) pre-training technique that led to the industry first of surpassing human performance on the (nocaps) benchmark. Before we learn more about this innovation, we should understand Vision and Language Pre-training (VLP) first. It is a cross-modality (across vision and language) learning technique that uses large-scale image/sentence data pairs to train machine learning models capable of generating natural language captions for images. However, because visual concepts are learned from image/sentence pairs which are costly to obtain, it is difficult to train a broadly useful model with wide visual concept coverage. This is where VIVO pre-training comes in. It improves and extends VLP to allow rich visual concepts to be learned from easier to obtain image/word pairs (instead of sentence) to build a large-scale visual vocabulary. While natural language sentence generation is still trained with limited visual concepts, the resulting image caption is cleverly enriched by new objects from the large-scale visual vocabulary.



Figure 1: VIVO pre-training uses paired image-tag data to learn a rich visual vocabulary where image region features and tags of the same object are aligned. Fine-tuning is conducted on paired image-sentence data that only cover a limited number of objects (in blue). During inference, our model can generalize to describe novel objects (in yellow) that are learnt during VIVO pre-training.


Please see this MSR blog post to learn more about VIVO pre-training.


Try the Service in Your App

Imagine you would like to generate alternative text descriptions for images your users upload to your app. Azure Computer Vision Service with its much improved “describe image” (image captioning) capability can help. Let us take it for a spin.


We will be using Python client library to invoke the service in this blog post. Try these links if you prefer a different language or invoking the REST API directly.  



  • Python
  • An Azure subscription - create one for free
  • Once you have your Azure subscription, create a Computer Vision resource:
    • Subscription: Pick the subscription you would like to use. If you just created a new Azure subscription, it should be an option in the dropdown menu.
    • Resource group: Pick an existing one or create a new one.
    • Region: Pick the region you would like your resource to be in.
    • Name: Give your resource a unique name.
    • Pricing tier: You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier for production.
    • Then click on “Review + create” to review your choices and click on “Create” again to deploy the resource


  • Once your resource is deployed, click “Go to resource.”


  • Click on “Keys and Endpoint” to get your subscription key and endpoint. You will be needing these for the code sample below.



Install the client

You can install the client library with:

pip install --upgrade azure-cognitiveservices-vision-computervision


Create and run the sample

  1. Copy the following code into a text editor.
  2. Optionally, replace the value of remote_image_url with the URL of a different image for which to generate caption.
  3. Also, optionally, set useRemoteImage to FALSE and set local_image_path to the path of a local image for which to generate caption.
  4. Save the code as a file with an .py extension. For example,
  5. Open a command prompt window.
  6. At the prompt, use the python command to run the sample. For example, python



import sys

from import ComputerVisionClient
from msrest.authentication import CognitiveServicesCredentials

# Best practice is to read this key from secure storage, 
# for this example we'll embed it in the code.
subscription_key = "<your subscription key here>"
endpoint = "<your endpoint here>"

# Create the computer vision client
computervision_client = ComputerVisionClient(
    endpoint, CognitiveServicesCredentials(subscription_key))

# Set to False if you want to use local image instead
useRemoteImage = True

if (useRemoteImage):
    # Get caption for a remote image, change to your own image URL as appropriate
    remote_image_url = ""
    description_results = computervision_client.describe_image(
    # Get caption for a local image, change to your own local image path as appropriate
    local_image_path = "<replace with local image path>"
    with open(local_image_path, "rb") as image:
        description_results = computervision_client.describe_image_in_stream(

# Get the first caption (description) from the response
if (len(description_results.captions) == 0):
    image_caption = "No description detected."
    image_caption = description_results.captions[0].text

print("Description of image:", image_caption)




What’s that? Microsoft AI system describes images as well as people do

Learn more about other Computer Vision capabilities.

Version history
Last update:
‎Feb 15 2022 09:49 AM
Updated by: