Video Retrieval: GPT-4 Turbo with Vision Integrates with Azure to Redefine Video Understanding
Published Nov 15 2023 08:00 AM 14.5K Views

Microsoft is thrilled to unveil the Azure AI Vision Video Retrieval preview. This innovative feature revolutionizes video search, enabling the exploration of thousands of hours of video content through advanced multi-modal vector indexing of vision and speech. Further enhancing the Azure OpenAI GPT-4 Turbo with Vision, Video Retrieval seamlessly integrates, providing customers with the capability to craft solutions that can both perceive and interpret video content. This opens novel possibilities and use cases. It simplifies the process for developers, allowing them to effortlessly incorporate video input into their applications, skipping complex video processing and indexing code. This is the power of Azure OpenAI Service and Azure AI Services working together.


Video Retrieval Enables Video Prompts


More video content is uploaded every 30 days than the major U.S. television networks created in the last 30 years. Azure OpenAI Service GPT-4 Turbo with Vision represents a massive leap in image and video understanding capabilities. Video Retrieval integration enables developer efficiency, allowing any developer to integrate video as an input directly within their app or service. The need for intricate video processing tasks—such as decoding, pre-processing, and selecting frames—is eliminated, streamlining the development workflow, and opening the capabilities of the latest OpenAI models to a broader audience.


With Azure OpenAI, leveraging video is as easy as clicking the “Upload Video” button in the Azure AI Studio Playground. The Azure OpenAI API will soon be enabled with the Video Retrieval enhancement as well. 



Get Grounded Answers Using Video Retrieval


Video Retrieval enables GPT-4 Turbo with Vision to answer video prompts using a curated set of images from the video as grounding data. This means that when you ask specific questions about scenes, objects or events in a video, the system provides more accurate answers without sending all the frames to the large multimodal model (LMM). This saves time and money, especially for long videos that might otherwise exceed the 128k token context window of GPT-4 Turbo with Vision. Azure AI Vision Video Retrieval will provide the relevant frames to the AI model, so that it can generate an accurate answer.


For example, auto insurance companies need to generate accident reports from videos. They can use a video prompt like “describe the vehicle damage”. Video Retrieval will give the AI model the frames that show vehicle damage clearly, reducing the time to generate a report.


This technique is inspired by the Retrieval-Augmented Generation (RAG) pattern that is popular for text documents. RAG is an approach that combines the power of large language models (LLMs) with external knowledge sources to generate more informed and contextually relevant responses. Video Retrieval extends this idea to video content, allowing the AI model to access relevant frames from a video based on the query. This way, the AI model can use visual information to generate a better answer.



Azure AI Studio Playground shows video prompt "what's the odometer read". The response successfully retrieves and reads the odometer.


Video Retrieval: find what you need in a video

Azure AI Vision Video Retrieval enables state-of-the-art natural language video search across thousands of hours of footage within a video index. This is achieved through a vector-based search that operates on both vision and speech data modalities, embodying a truly multimodal approach. By utilizing Azure AI Vision multimodal vector embeddings—a numerical representation that captures the essence of text or imagery—Azure AI Video Retrieval can find specific content at a granular level.


Video Retrieval enhances search precision through a three-step process:


  1. Vectorization: Initially, the system vectorizes both the content and the search queries. It encodes key data from selected video frames and speech transcripts into vector embeddings, translating these elements into a numerical vector format. Similarly, when a user submits a search query, it is also converted into a vector within the same dimensional space. Alignment for similar concepts between the embeddings for the query and the video enables effective retrieval.
  2. Measurement: During this stage, the system conducts a comprehensive vector similarity analysis. It measures the alignment between the vector of the user's query and the vectors representing the video data. The result is a set of search results that most closely correspond to the user’s query for both the spoken words and the visual elements of the video according to the embedding models.
  3. Retrieval: The last step is the retrieval of relevant content. The system identifies and extracts the specific video segments that match the query, based on the similarity analysis. These identified segments are then presented to the user.


As outlined in the transparency note, users should be aware of the guidelines for responsible use of this AI. It is important to recognize that these retrieval matches are approximate and should be tested for your specific scenario. There is risk of both false positives and false negatives causing errors in the system.



Graph shows the 3 step process for Video Retrieval


Video Retrieval is now in Public Preview and you can try it out today in the Video Retrieval demo on Vision Studio and leverage the APIs in the Do video retrieval using vectorization - Image Analysis 4.0 - Azure AI services | Microsoft Learn tutorial.



Video Retrieval and Summary demo enables you to upload your video to try out Video Retrieval


Customer Spotlight: WPP Satalia


Satalia is the AI hub for WPP, one of the world's largest communications services groups, known primarily for its work in advertising and public relations. Satalia's collaboration with Microsoft leverages GPT-4 Turbo with Vision and Azure AI Vision to creatively transform content analysis and optimization. These technologies enable the deep evaluation and optimization of video content, such as advertisements and social media posts, offering profound insights into content effectiveness and audience engagement.


The detailed summaries of video created by GPT-4 Turbo with Vision with Video Retrieval enable Satalia's AI tool to predict the impact of video content and suggest improvements, aligning with audience expectations and platform specifics. This fusion of AI and human creativity ensures that content is not only visually appealing but also resonates emotionally.


We have been experimenting with a wide range of image-to-text and video-to-text tools over the past two years to equip our AI solutions with the capability to analyse and produce more effective creative assets through decoding video in ways never thought possible.

I can safely say that GPT-4 Turbo with Vision is by far the best tool that we have worked with, as it offers perfect perception of both visual content and context.

Daniel Hulme, CEO of Satalia, a WPP Company


Satalia uses GPT-4 Turbo with Vision and Azure AI Vision to create detailed summaries of advertisements enabling content optimization.


The upcoming introduction of video prompts for GPT-4 Turbo with Vision, enabled by the Azure AI Vision Video Retrieval service, represents our ongoing commitment to deliver cutting edge AI and empower developers to do more with AI. We are excited to see how our customers leverage this new functionality to advance their businesses and drive innovation.


This integration represents just the beginning of our journey to refine and expand these video understanding capabilities. Over the coming months, we plan to continue evolving and enhancing this technology. We look forward to enabling your enterprise to take advantage of these capabilities and are excited to see what you create and hear your feedback.


Get started with Azure AI OpenAI Service and Azure Vision Service today:

1 Comment
Version history
Last update:
‎Nov 15 2023 10:56 AM
Updated by: