Announcing general availability of Azure AI Vision Image Analysis 4.0 API

Microsoft

Nov 15, 2023

We are thrilled to announce the general availability of Azure AI Image Analysis 4.0. This cutting-edge solution offers a single API endpoint, empowering users to extract comprehensive insights from images. Key features of Azure AI Image Analysis 4.0 include:

Optical Character Recognition
Multimodal image embeddings
Image Captioning
Dense Captions
Tagging
People Detection (without identification of individuals)
Smart Crops
Object Detection

What sets Azure AI Vision Image Analysis 4.0 apart is its utilization of Microsoft’s Florence foundation model, rooted in a unified paradigm, seamlessly integrates spatial, temporal, and multimodal dimensions within computer vision using a single pre-trained model and network architecture. By leveraging large-scale image and video datasets, our approach not only establishes a versatile foundation for a multitude of computer vision tasks, enhancing AI performance and cost-effectiveness, but also empowers effortless zero- and few-shot customization for downstream applications, solidifying its status as a true game-changer.

Reddit will be using Vision Services to generate captions for hundreds of millions of images on its platform. Tiffany Ong, Reddit Product Manager of Consumer Product has said,

“With Microsoft’s Vision technology, we are making it easier for users to discover and understand our content. The newly created image captions make Reddit more accessible for everyone and give redditors more opportunities to explore our images, engage in conversations, and ultimately build connections and a sense of community."

Empowering Billions of Customers to Achieve More

Microsoft’s Florence foundation model, a cost-efficient large multimodal generative model, made its groundbreaking debut in November 2021, reshaping the landscape of technology and AI. Since its introduction, it has garnered immense adoption by some of the world's most prominent and widely used applications, serving billions of customers.

In 2022, LinkedIn took a significant step toward inclusivity by announcing auto captioning. This feature empowers LinkedIn members to easily edit and support their content, ensuring that every LinkedIn member has equal access to opportunities for communication and networking. Microsoft’s other product lines including M365 Word, PowerPoint, Outlook, Excel have adopted the same.

In September 2023, Microsoft introduced the latest Windows 11 Paint App and Windows 11 Photo App, which incorporates an advanced "Background Removal" and “Background Blur” features. This feature harnesses the state-of-the-art capabilities of the Florence foundation model to provide users with a streamlined and efficient image editing experience.

Moreover, in October 2023, the Microsoft OneDrive team unveiled the eagerly awaited next generation of OneDrive. As part of the event, they announced the limited preview of natural language search, powered by Image Analysis 4.0 API's multimodal embeddings combined with other Azure AI technology such as Face API within the consumer photos experience.

"We’re introducing natural language search in your photos experience. Just type what you are looking for and OneDrive will find it. This search goes far beyond object recognition, you can ask it to find specific places, settings, objects, and people all in one search. For example, if you are looking for a particular photo you can search for “camping in the fall, in the mountains with Caroline” and you’d see just the images you are looking for." - Jason Moore, VP of OneDrive Product

The contributions of the Florence foundation model and Microsoft's ongoing commitment to innovation continue to drive progress and enhance the user experience across a wide spectrum of applications and services.

Fostering Trust with Robust Visual Insights

Azure AI Image Analysis 4.0, the forefront of image analysis technology, has raised the bar with its industry-leading object grounded captions (dense captions). This innovative feature not only generates descriptive captions for images but also provides a groundbreaking validation process, outlining objects within bounding boxes that align with the captions.

A group of people posing on a rocky beach

In conjunction with this momentous general availability announcement, we are thrilled to introduce the public preview of object grounding for GPT-4 Turbo with Vision, a close collaboration between OpenAI and Azure AI team. This partnership of advanced AI technologies brings to life a seamless natural language response for image prompts, with a unique focus on grounded objects. It represents a significant leap forward in the capabilities of large multimodal generative models.

"The big difference is that now these large models have been trained to recognize so many different things and the language capabilities we can get very rich, vivid descriptions of images and the world around us. What this will enable us to share together. I imagine this future where AI and humans and work together where the AI understands us as individuals and fills in the gaps for each and every one of us." - Saqib Shaikh, creator of Seeing AI

Accelerating democratization of AI with Azure efficiency

Running on Azure’s purpose-built, AI-optimized infrastructure, Image Analysis 4.0 API achieves ultra-low latency, is suitable for super large scale image vectorization, video indexing and retrieval and acts as the foundation for the newly announced public preview of Azure AI Vision Video Retrieval Service and Azure OpenAI Service video prompt for GPT-4 Turbo with Vision. Image Analysis 4.0 API’s multimodal embedding efficiently compresses visual semantic insights into compact vectors and accelerates your application to run large multimodal generative models on your media contents.