Analytics on Azure Blog

12 MIN READ

Transforming Text to Video: Harnessing the Power of Azure Open AI and Cognitive Services with Python

Microsoft

Jul 03, 2023

Introduction

In today's digital age, video content has become a powerful medium for communication and storytelling. Whether it's for marketing, education, or entertainment purposes, videos could captivate and engage audiences in ways that traditional text-based content often cannot. However, creating compelling videos from scratch can be a time-consuming and resource-intensive process.

Fortunately, with the advancements in artificial intelligence and the availability of cloud-based services like Azure Open AI and Cognitive Services, it is now possible to automate and streamline the process of converting text into videos. These cutting-edge technologies provide developers and content creators with powerful tools and APIs that leverage natural language processing and computer vision to transform plain text into visually appealing and professional-looking videos.

This document serves as a comprehensive guide and a starting point for developers who are eager to explore the exciting realm of Azure Open AI and Cognitive Services for text-to-video conversion. While this guide presents a basic implementation, its purpose is to inspire and motivate developers to delve deeper into the possibilities offered by these powerful technologies.

Whether you are a developer looking to integrate text-to-video functionality into your applications or a content creator seeking to automate the video production process, this guide will provide you with the insights and resources you need to get started. So let's dive in and discover the exciting world of text-to-video conversion using Azure Open AI and Cognitive Services!

Prerequisite

The next part of the sections will show the Architecture and the implementation through python coding. If you are new to these technologies, don’t worry please go through this prerequisite links to get started:

Azure Open AI : Get started with Azure OpenAI Service - Training | Microsoft Learn
Azure Cognitive Services:
1. Key Phrase Extraction : What is key phrase extraction in Azure Cognitive Service for Language? - Azure Cognitive Services | Microsoft Learn
2. Speech To Text: Speech to text overview - Speech service - Azure Cognitive Services | Microsoft Learn
Python coding : You will find multiple courses from internet, you can refer Learn Python - Free Interactive Python Tutorial

Architecture

Figure 1: Architecture

The following architecture outlines a generic flow for converting text content into video files:

The steps are explained below :

Initially, an application (implemented in Python, but applicable to any programming language) accepts textual content as input from the user.
The application utilizes the Azure Open AI Python SDK to invoke the summarization functionality, which generates a summarized text.
This summarization is stored in memory for further use.
The summarized content serves as input for the Azure Cognitive Services, specifically for generating key phrases and an audio file.
The key phrases are extracted and stored in the application's memory for later use.
Simultaneously, the audio file is stored and persisted on the compute server. Alternatively, it can be stored in any preferred persistent storage solution.
The key phrases are then used as input for the Azure Open AI API, generating meaningful DALL·E prompts.
These DALL·E prompts are stored in memory for subsequent utilization.
The DALL·E prompts serve as input for another Azure Open AI API call, generating images that will be used in the final video.
The generated images are stored on the compute server or any chosen persistent storage medium.
To create the final video, a custom Python application is employed. This application combines the previously generated audio and images, resulting in the creation of the video file. The final video is initially stored on the compute server but can be subsequently pushed to any desired storage layer for further consumption.

By following this process, textual content can be effectively transformed into a video file, providing enhanced accessibility and visual representation of the original information.

Text Summarization through Azure Open AI

The code provided is an example of text summarization using OpenAI's GPT-3.5 language model. Here's a breakdown of the code:

Importing necessary libraries and setting up OpenAI API:

import os

import openai

openai.api_type = "azure"

openai.api_base = "https://<Your_Resource_Name>.openai.azure.com/"

openai.api_version = "2022-12-01"

openai.api_key = "<Your API Key>"

This section imports the required libraries and sets up the OpenAI API credentials. You would need to replace <Your_Resource_Name> with your actual resource name and <Your API Key> with your OpenAI API key.

Setting the number of sentences for the summary:

num_of_sentences = 5

This line defines the number of sentences that the summary should consist of. You can change this value according to your requirements.

Obtaining user input:

content = input("Please enter the content: ")

This line prompts the user to enter the content they want to summarize and stores it in the content variable.

Creating the prompt for summarization:

prompt = 'Provide a summary of the text below that captures its main idea in '+ str(num_of_sentences) +'sentences. \n' + content

This line constructs the prompt by combining the predefined sentence with the user's input content.

Generating the summary using OpenAI's Completion API:

response_summ = openai.Completion.create(

  engine="text-davinci",

  prompt=prompt,

  temperature=0.3,

  max_tokens=250,

  top_p=1,

  frequency_penalty=0,

  presence_penalty=0,

  best_of=1,

  stop=None)

This code sends a request to the OpenAI API for generating the summary. It uses the openai.Completion.create() method with the following parameters:

-> engine: Specifies the language model to use. Here, it uses the "text-davinci" model, which is a powerful and versatile model.

-> prompt: The prompt for the model to generate a summary based on.

-> temperature: Controls the randomness of the generated output. Lower values (e.g., 0.3) make the output more focused and deterministic.

-> max_tokens: Specifies the maximum number of tokens the response can have. Tokens are chunks of text, and this value limits the length of the generated summary.

-> top_p: Controls the diversity of the output. A higher value (e.g., 1) allows more diverse responses by considering a larger set of possibilities.

-> frequency_penalty and presence_penalty: These parameters control the preference of the model for repeating or including certain phrases. Here, they are set to 0, indicating no preference.

-> best_of: Specifies the number of independent tries the model will make and return the best result.

-> stop: Specifies a string or list of strings at which to stop the generated summary.

Printing the generated summary:

print(response_summ.choices[0].text)

This line prints the generated summary by accessing the text property of the first choice in the response. The summary will be displayed in the console.

Key Phrase Extraction using Azure Cognitive Service

The code provided demonstrates key phrase extraction using Microsoft Azure's Text Analytics service. Here's an explanation of the code:

Setting up the required credentials and endpoint:

key = "<Your_cognitive_service_key>"

endpoint = "https://<Your_cognitive_service>.cognitiveservices.azure.com/"

These lines define the cognitive service key and endpoint for the Text Analytics service. You need to replace <Your_cognitive_service_key> with your actual cognitive service key and <Your_cognitive_service> with the name of your cognitive service.

Importing necessary libraries:

from azure.ai.textanalytics import TextAnalyticsClient

from azure.core.credentials import AzureKeyCredential

These lines import the required libraries from the Azure SDK.

Authenticating the client:

def authenticate_client():

    ta_credential = AzureKeyCredential(key)

    text_analytics_client = TextAnalyticsClient(

            endpoint=endpoint,

            credential=ta_credential)

    return text_analytics_client



client = authenticate_client()

This code defines the authenticate_client() function that creates an instance of the TextAnalyticsClient using the provided key and endpoint. The client variable stores the authenticated client.

Defining the key phrase extraction example:

def key_phrase_extraction_example(client):



    try:

        phrase_list, phrases = [], ''

        documents = [response_summ.choices[0].text]



        response_kp = client.extract_key_phrases(documents = documents)[0]



        if not response_kp.is_error:

            print("\tKey Phrases:")

            for phrase in response_kp.key_phrases:

                print("\t\t", phrase)

                phrase_list.append(phrase)

                phrases = phrases +"\n"+ phrase          

        else:

            print(response_kp.id, response_kp.error)



    except Exception as err:

        print("Encountered exception. {}".format(err))

    return phrase_list, phrases

This code defines the key_phrase_extraction_example() function that takes the authenticated client as input. It performs key phrase extraction on a given document (in this case, response_summ.choices[0].text) using the client.extract_key_phrases() method. The extracted phrases are stored in the phrase_list and phrases variables. If there is an error, it is printed.

Executing the key phrase extraction example:

phrase_list, phrases = key_phrase_extraction_example(client)

This line calls the key_phrase_extraction_example() function with the authenticated client as an argument. The extracted key phrases are stored in the phrase_list and phrases variables, which can be used for further processing or display.

Overall, the code sets up the Azure Text Analytics client, authenticates it, and demonstrates key phrase extraction on a given text using the client.

Create Dall-e prompts for image generation using Azure Open AI

The code provided focuses on generating images based on the extracted phrases. Here's an explanation of the code:

Creating a prompt for image generation:

prompt = ''' Provide an image idea for each phrases: ''' + phrases

This line creates a prompt by concatenating the phrase list obtained from the key phrase extraction with a predefined text. The prompt serves as input for the image generation model.

Generating image ideas using OpenAI's text completion API:

response_phrase = openai.Completion.create(

  engine="text-davinci",

  prompt=prompt,

  temperature=0.3,

  max_tokens=3000,

  top_p=1,

  frequency_penalty=0,

  presence_penalty=0,

  best_of=1,

  stop=None)

This code uses OpenAI's text completion API to generate image ideas based on the provided prompt. The generated ideas are stored in response_phrase.choices[0].text.

Extracting image phrases from the generated response:

image_phrases = response_phrase.choices[0].text.split("\n")[1:]

This code splits the generated response by newlines and stores the resulting lines (image phrases) in the image_phrases variable.

Processing image phrases:

im_ph = []

for image_phrase in image_phrases:

    #print(image_phrase)

    if(len(image_phrase) > 0):

        im_ph.append(image_phrase.split(":")[1])

This code processes each image phrase by splitting it based on the colon (":") character and appending the second part to the im_ph list. This step is done to extract the actual image idea from each phrase.

Setting up the necessary variables:

import requests

import time

import os

api_base = 'https://<Your_Resource_Name>.openai.azure.com/'

api_key = "<Your_API_KEY>"

api_version = '2022-08-03-preview'

url = "{}dalle/text-to-image?api-version={}".format(api_base, api_version)

headers= { "api-key": api_key, "Content-Type": "application/json" }

These lines define the API base URL, API key, API version, and the endpoint URL for the DALL-E model's text-to-image generation.

Generating images using DALL-E:

images = []



for phrase in im_ph:

    body = {

        "caption": phrase ,

        "resolution": "1024x1024"

    }

    submission = requests.post(url, headers=headers, json=body)

    print(submission)

    operation_location = submission.headers['Operation-Location']

    retry_after = submission.headers['Retry-after']

    status = ""

    #while (status != "Succeeded"):

    time.sleep(int(retry_after))

    response = requests.get(operation_location, headers=headers)

    status = response.json()['status']

    print(status)

    if status == "Succeeded":

        image_url = response.json()['result']['contentUrl']

        images.append(image_url)

This code performs the image generation using the DALL-E model. It sends a POST request to the DALL-E text-to-image endpoint with each image phrase as the caption and the desired resolution. The API response contains the location of the operation and the estimated time to wait. The code then waits for the specified duration and retrieves the response using a GET request. If the status of the operation is "Succeeded," the generated image URL is extracted and added to the images list.

Downloading the generated images:

import urllib.request

counter = 0

image_list = []

for url in images:

    counter += 1

    filename = "file" + str(counter) + ".jpg"

    urllib.request.urlretrieve(url, filename)

    image_list.append(filename)

print ("Downloading done.....")

This code downloads the generated images by iterating over the list of image URLs. Each image is downloaded using urllib.request.urlretrieve and saved with a unique filename. The filenames are stored in the image_list list.

It seems that this code integrates the DALL-E model with the provided phrases to generate corresponding images.

Create Audio File using Azure Speech Service

The code provided demonstrates how to create audio files for the text summarization output using the Azure Cognitive Services Speech SDK. Here's an explanation of the code:

Import the package

import azure.cognitiveservices.speech as speechsdk

To use the Azure Cognitive Services Speech SDK, you need to import the speechsdk module from the azure.cognitiveservices.speech package.

Setting up the necessary variables:

speech_key, service_region = "<Your speech key>", "<location>"

These variables represent your Azure Cognitive Services Speech API key and the region where your service is hosted.

Creating the SpeechConfig object:

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

The speech_config object is created using the Speech API key and service region. It provides the necessary configuration for the speech synthesizer.

Defining the text_to_speech function:

def text_to_speech(text, filename):

    audio_config = speechsdk.AudioConfig(filename=filename)

    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

    result = speech_synthesizer.speak_text_async(text).get()

    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:

        print(f"Audio saved to {filename}")

    else:

        print(f"Error: {result.error_details}")

This function takes the text and filename as input. It creates an audio_config object with the specified filename to store the synthesized audio. Then, a speech_synthesizer object is created using the provided speech configuration and audio configuration. The speak_text_async method is called to synthesize the input text into audio. The result is then checked, and if the audio synthesis is completed successfully, it prints a success message along with the filename. Otherwise, it prints an error message with the details of the error.

Generating audio for the text summarization output:

text = response_summ.choices[0].text

filename = "audio.mp4"

text_to_speech(text, filename)

This code retrieves the text from the response_summ object, which contains the summarized text. It then specifies the filename for the audio file. The text_to_speech function is called with the text and filename to generate the audio file.

In summary, this code uses the Azure Cognitive Services Speech SDK to convert the summarized text into audio by utilizing the speech synthesis capabilities provided by the Azure Speech service.

Stich the Audio File and the images to create the video.

The code provided is for creating a video by combining a sequence of images with an audio file. Here's a breakdown of the code:

#Stich the audio files and the images together



from moviepy.editor import *



print("Creating the video.....")



def create_video(images, audio, output):

    clips = [ImageClip(m).resize(height=1024).set_duration(2) for m in images]

    concat_clip = concatenate_videoclips(clips, method="compose")

    audio_clip = AudioFileClip(audio)

    final_clip = concat_clip.set_audio(audio_clip)

    final_clip.write_videofile(output, fps=24)



images = image_list

audio = filename

output = "video.mp4"

create_video(images, audio, output)



print("Video created.....")

The from moviepy.editor import * statement imports the necessary functions and classes from the MoviePy library, which is used for video editing and manipulation.
The create_video function is defined to generate the final video. It takes three parameters: images, audio, and output.
Inside the create_video function, a list comprehension is used to create a sequence of video clips (clips) from the provided images list. Each image is converted to a video clip using ImageClip(m), where m is the path to the image file. The resize function is used to set the height of each clip to 1024 pixels, and set_duration sets the duration of each clip to 2 seconds.
The concatenate_videoclips function is used to concatenate the video clips in clips into a single clip (concat_clip). The method="compose" argument specifies that the clips should be composited together.
The AudioFileClip class is used to load the audio file (audio) and create an audio clip (audio_clip).
The audio clip is then set to the concatenated video clip using set_audio, creating the final clip (final_clip).
The write_videofile function is called on final_clip to save the video to the specified output file (output). The fps=24 argument sets the frame rate of the video to 24 frames per second.
The images, audio, and output variables are assigned with the appropriate values (image_list, filename, and "video.mp4", respectively).
The create_video function is called with the provided arguments to generate the video.
Finally, a message is printed indicating that the video creation process is complete.

Note: This code assumes that you have the MoviePy library installed. If not, you can install it using pip install moviepy.

Conclusion

In conclusion, the code provided offers a good starting point for a text summarization and multimedia generation solution. It combines various technologies and APIs to perform text summarization, key phrase extraction, image generation, audio synthesis, and video creation.

The text summarization process involves inputting a text content and using OpenAI's language model to generate a summary capturing the main idea. The summary is then used for key phrase extraction using Azure Cognitive Services. These key phrases serve as prompts for generating image ideas using OpenAI's image generation capabilities.

Once the image phrases are obtained, they are used to request images from the DALL-E model. The images are downloaded and stored locally for further use. Additionally, the key phrases are converted into audio using Azure Cognitive Services' Text-to-Speech functionality, and the audio file is saved.

Finally, the images and audio are stitched together using the MoviePy library to create a video. The images are resized, and a composited video clip is generated by concatenating the image clips. The audio file is added to the video clip, resulting in the final video.

It's important to note that this solution is not perfect and may require further customization and fine-tuning based on specific requirements. Additionally, it relies on external services and APIs, which may have limitations or dependencies. However, it provides a solid foundation for implementing a text summarization and multimedia generation pipeline.

By leveraging the power of natural language processing, image generation, audio synthesis, and video editing, this solution demonstrates the potential to automate the creation of engaging multimedia content from text. Further enhancements and integrations can be explored to improve the accuracy and quality of the generated summaries, images, audio, and videos.

Updated Jun 01, 2023

Version 1.0

azure

Sabyasachi-Samaddar

Microsoft

Joined August 12, 2021

View Profile

Analytics on Azure Blog

Follow this blog board to get notified when there's new activity