Your Video Insights, Promptly Extracted: Azure AI Video Indexer's Preview of Prompt-Ready API
Published Jan 15 2024 01:22 AM 4,597 Views
Microsoft

Have you ever watched an online course and wished you could ask questions on the entire course, or have a comprehensive summary of a video? This can all now be achieved with Azure AI Video Indexer and LLMs (Large Language Model) – powering each other.

LLMs are powerful language models that can capture the essence of text, allow natural language question-answering and much more. In Azure AI Video Indexer, we understand videos – video content is more than just words, and a single shot can contain a wealth of insights that are critical for its understanding. Coupling these two powerful tools can lead to great results in video understanding and downstream tasks in natural language.

Our new API extracts and processes all the multi-modality insights of a video into prompt-ready format, that can be easily used with LLMs.

This API can be used on already-indexed videos in Azure AI Video Indexer, so there is no need to index videos again to create the prompt-ready format of the videos.

 

Prompt Content for Video Understanding

Azure AI Video Indexer has a new algorithm that translates the multi-modality content understanding into an LLM’s prompt-ready format, capturing the important details and insights in the video, which then can be used as-is to create LLM prompts for various tasks, including video summarization or search.

 

algo flow.png

Figure 1- Our algorithm flow. Starting by capturing the important details and insights in the video using Azure AI Video Indexer (A, B), then splitting the video and its insights into coherent chapters based on visual context (C), and further splitting the sections based on other insights given the LLM prompt’s limitations (D).

 

Our new algorithm is based on advanced AI models developed in Azure AI Video Indexer. It effectively integrates all three modalities – visual, audio and text – based on the main insights from Azure AI Video Indexer, processes them and transforms them into an LLM’s prompt-ready format. The method consists of the following steps:

  • Extracting multi-modal insights: As shown in Figure 1, steps A+B create the insights of the video and allow for a full video understanding. However, having all the insights of the video and its transcript as a prompt for an LLM is problematic. First, because of the prompt size. Second, it’s just too much information, and we need to provide the main insights and separations to the LLM in order to get good results. Therefore, we extract the essence from each insight. For example, we eliminate small OCR, filter visual labels, and more.
  • Insights’ “tags”:  In order to give more context to the LLM that will ease its video understanding and combining all the insights, we create “tags” that guide the LLM on the insights’ roles within the content. These tags include labels such as [OCR], [Transcript], [Audio effects] and more. An example of a section content with several “tags” is shown in Figure 2.
  • Chaptering to sections: We split the video and its insights into coherent chapters, that fit both the essence of the video and the prompt size. We use scene segmentation, which is based on visual content (Figure 1C). However, we don't stop there - we also use the other modalities, such as audio and text, to divide the scenes further smartly into smaller sections to work within the limitations of LLMs. Each section fits to a prompt size and contains the content of the video at that time – including the transcript, audio events (such as clapping, dog barking etc.), and visual content (objects in the video, celebrities, and more). Each part in the video is consolidated, and the matching insights are used to create each section (Figure 1D). We determine the length of the sections, ensuring they are not too long for using them as prompts, and not too short for effective and meaningful content.

The final output is shown in Figure 3, and we call it the Prompt Content Json file.

Overall, our method combines most of all modalities to provide an effective approach to analyze videos’ content with LLMs. We show examples of two use-cases: Video-to-text summarization and searching within the video content. To exemplify the abilities of combining our prompt content with an LLM, we use videos from AKS Edge Essentials tutorials series.

 

content example.png

Figure 2 – An example of a section’s content, created with our new algorithm and shown in the output Json file from our new API (Figure 3).

 

prompt content json example AKS.png

Figure 3 – An example of the output Json file from our new API.

 

flow idea.png

Figure 4 – The flow from video to downstream tasks, starting with a video, extracting its insights with Azure AI Video Indexer and creating prompt content with our new algorithm. The prompt content can be used with any LLM, such as GPT, for tasks such as summarization and question-answering.

 

Video to Text Summarization

Video is more than just words. Today’s methods for video summarization rely mostly on the audio part of videos (transcript). Our approach enriches this information with visual and audio data. By combining these three modalities we can gain a better understanding of the video’s content, hence the potential for a better summarization that captures the essence of the video. We use a simple iterative algorithm for summarization that enables us to summarize long videos with rich content. First, we use our Prompt Content for chaptering and creating prompt-ready sections of the video, where each section includes its matching textual, audio and visual insights (as explained above). Then, at each iteration, we summarize the current section and the cumulative summary of the previous sections (given as additional information in the prompt). We can also control the summary “flavor” – whether we want to create an informative summary, a teaser and much more, by enhancing only the prompt of the final section summary.

Summarization demo

 

summarization flow.png

Figure 5 – The video summarization flow. We start by using our new API to create the prompt content, that is divided into sections. Then, we use a simple iterative method for summarization using LLM.

 

For the summarization example, we will use a video that describes the storage architecture of AKS Edge Essentials. It also demonstrates how to create storage classes with local path and NFS, how to create a PVC, and how to test them on Linux.

 

Informative summary example – made with our prompt content and Chat GPT

In this video from AKS Edge Essentials, Lior and a guest demonstrate how to use local path and NFS storage classes with AKS Edge Essentials. The guest demonstrates how to create a Persistent Volume Claim (PVC) using the kubectl apply command and tests it with the kubectl exec command. He also shows how to use the Invoke-AksEdgeNodeCommand feature to interact with the node’s operating system and create a local-path-pvc. They explain the use case for using the NFS storage provisioner for multi-node clusters and how to mount the PVC using the deployment YAML file to store data. In conclusion, they explain how to use the volume-test command to view the PVC and demonstrate how to use it for application development.

 

As video is more than just words, and specifically in this video, an important part is the live demo showing useful commands. The highlighted sentences came from the visual only.

 

Teaser summary example – made with our prompt content and Chat GPT

Check out the latest AKS Edge Essentials video with Lior to learn how to deploy an NFS storage class and use it to store data. Follow along as they demonstrate how to set up persistent storage for applications using persistent volumes, persistent volume claims, and storage classes. Don’t miss this informative and educational video!

 

Searching within a Video Archive with Natural Language

Video archives can be huge and not easy to search. We aim to simplify video archive searching and allow question-answering by combining our Prompt Content file with a document archive search technique. This technique typically involves an embedding model, vector store, search engine, and NLP generative model. For video archive, instead of searched “documents”, we search through sections of the videos, created using our Prompt Content algorithm. Here is how we do it for video archives:

  • Section creation: We create sections from videos using the Prompt Content algorithm, that are based on multi-modality insights of the videos. These sections are used as the searched “documents”.
  • Embedding model: For each section we create embeddings using OpenAI’s ada-002 embedding model.
  • Storage: Azure AI Search is employed to store the sections’ content and their corresponding embeddings (vectors).
  • Search method: Given a query, we use OpenAI’s ada-002 to create its embeddings, and then we search the N most relevant sections by cosine similarity metric between the vectors.
  • Answer generation: Finally, with the top N sections, we utilize OpenAI’s Chat-GPT 3.5 Turbo model to generate a natural language answer.

This approach simplifies the process of searching within video archive and finding answers by leveraging well-established techniques from document searching and LLMs.

Question-answering demo

 

search flow.png

Figure 6 – The search flow, using Azure AI Video Indexer to create the prompt content, Azure AI Search for the search process and LLMs for embeddings and generating a textual answer based on the top search results.

 

Question-answering examples - AKS Edge Essentials video archive

Question: What is the PowerShell command to create a JSON template for AKS edge deployment?

Answer: The command is 'New-AksEdgeConfig -DeploymentType ScalableCluster -NodeType LinuxAndWindows' 1.

Citations: 1. AKS Edge Essentials scheme and Azure Arc (0:03:30)

 

Question: Please summarize the difference between single and multi-machine cluster?

Answer: Single machine cluster uses internal virtual switch network architecture and all the IP addresses are defined 12. Scalable cluster uses external virtual switch network architecture and all the IP addresses need to be set up as part of the deployment 12. When using single machine cluster, NAT is used, whereas scalable cluster connects directly to a physical network 2. The DNS and proxy settings are the same for both 2. Multi-machine cluster is used for industrial scenarios where one device can be used as a backup if another device goes down 3

Citations: 1. AKS Edge Essentials network architecture (0:08:31) 2. AKS Edge Essentials network architecture (0:10:30) 3. AKS Edge Essentials Multi Node (0:05:40)

 

Video Summarization and Question-Answering Demonstrations

Check out our demonstrations for the downstream tasks using our new algorithm and API:

See also  Azure AI Search documentation | Microsoft Learn

Want to explore Video Indexer and stay up to date on all releases? Here are some helpful resources:

3 Comments
Version history
Last update:
‎Jan 18 2024 02:18 AM
Updated by: