Transforming Text to Video: Azure Open AI, Cognitive Services and Semantic Kernel (Python)

Sabyasachi-Samaddar · ‎Aug 21 2023

Introduction

Welcome back to the second part of our journey into the world of Azure and OpenAI! In the first part, we explored how to transform text into video using Azure’s powerful AI capabilities. This time, we’re taking a step further by orchestrating our application flow with Semantic Kernel.

Semantic Kernel is a powerful tool that allows us to understand and manipulate the meaning of text in a more nuanced way. By using Semantic Kernel, we can create more sophisticated workflows and generate more meaningful results from our text-to-video transformation process.

In this part of the series, we will focus on how Semantic Kernel can enhance our application and provide a smoother, more efficient workflow. We’ll dive deep into its features, explore its benefits, and show you how it can revolutionize your text-to-video transformation process.

One key change in our approach this time is skipping the entity recognition step from the last blog. After careful consideration, we’ve found that it’s not necessary for our current workflow. This decision allows us to streamline our process and focus on what truly matters - creating high-quality video content from text.

So, buckle up and get ready for an exciting journey into the world of Semantic Kernel and Azure! Let’s dive in.

If you want to check my older post on this, please feel free to check it here.

Prerequisite

The next part of the sections will explain the flow and the implementation through Python coding through Semantic Kernel Orchestrations. If you are new to these technologies, don’t worry please go through these prerequisite links to get started:

Azure Open AI: Get started with Azure OpenAI Service - Training | Microsoft Learn
- We are going to use Azure OpenAI completion models and not chat models. So please use Azure Open AI GPT 3.5 Turbo (0301) model for this.
Azure Cognitive Services:
- Speech To Text: Speech to text overview - Speech service - Azure Cognitive Services | Microsoft Learn
Python coding: You will find multiple courses on the internet, you can refer Learn Python - Free Interactive Python Tutorial
Semantic Kernel Orchestrations: Orchestrate your AI with Semantic Kernel | Microsoft Learn

Creating the Environment File

Now we will create our environment file, you can name it as .env. Please fill the details in the file in the below format:

AZURE_OPENAI_DEPLOYMENT_NAME="<AzureOpenAI Deployment Name>"
AZURE_OPENAI_ENDPOINT="<Azure OpenAI Endpoint>"
AZURE_OPENAI_API_KEY="<Azure Open AI Key>"
SPEECH_KEY="<Azure Cognitive Service Key>"
SPEECH_REGION="<Azure Cognitive Service Region>"
DALLE_API_BASE="<Azure DALL-E2 API base url>"
DALLE_API_KEY="<Azure DALL-E2 API key>"
DALLE_API_VERSION="<Azure DALL-E2 API Version>"

If you are using AzureOpenAI and DALL-E both in the same resource the AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY will be the same for DALLE_API_BASE and DALLE_API_KEY.

Explanation of the variables:

The code above is a configuration file written in the dotenv format. It contains several environment variables that are used to configure various APIs and services.

The first three environment variables are related to the OpenAI API. The AZURE_OPENAI_DEPLOYMENT_NAME variable specifies the name of the deployment that is used to host the GPT Turbo Completion model. The AZURE_OPENAI_ENDPOINT variable specifies the endpoint URL for the OpenAI API. Finally, the AZURE_OPENAI_API_KEY variable contains the API key that is used to authenticate requests to the OpenAI API.

The next two environment variables are related to the Azure Speech Services API. The SPEECH_KEY variable contains the API key that is used to authenticate requests to the Speech Services API. The SPEECH_REGION variable specifies the region where the Speech Services API is hosted.

The last three environment variables are related to the DALL-E API, which is another OpenAI API. The DALLE_API_BASE variable specifies the base URL for the DALL-E API. The DALLE_API_KEY variable contains the API key that is used to authenticate requests to the DALL-E API. Finally, the DALLE_API_VERSION variable specifies the version of the DALL-E API that is being used.

Overall, this configuration file is used to store sensitive information and configuration settings that are used by the application. By using environment variables, the application can easily access this information without having to hardcode it into the source code.

To improve the readability of this code, it could be helpful to group related environment variables together and add comments explaining what each variable is used for. Additionally, it may be beneficial to use a tool like dotenv-linter to ensure that the file is formatted correctly and that there are no syntax errors.

Creating the Folder Structure

Now I will suggest a folder structure for this application. It is not mandatory to follow.

I have created 4 folders for this application.

Audio: Folder to store my audio generated by the Azure Speech Service
Images: Folder to store my images generated by the DALL-E API calls
myPlugins: Folder to keep all my plugins for this application.
Video: Folder to store my final Video from this application.

Creating the Plugins

Semantic Kernel plugins serve as modular functions meticulously crafted by developers to enhance the capabilities of AI systems. These versatile units encapsulate AI functionalities, forming the fundamental components of the Semantic Kernel's architecture. They seamlessly interact with plugins within ChatGPT, Bing, and Microsoft 365, fostering a harmonious ecosystem of AI innovation.

These plugins are distinguished by two distinct function types:

Semantic Functions: These are dynamic prompts intertwined with contextual variables. They primarily handle textual inputs and outputs, functioning as conduits for nuanced interactions with the AI system.
Native Functions: These functions, while integral, remain unelaborated in search results, yet contribute significantly to the plugin repertoire.

Moreover, the Semantic Kernel boasts an array of pre-designed plugins tailored to diverse programming languages. These foundational plugins, commonly referred to as Core plugins, exemplify the Semantic Kernel's comprehensive utility. Some of these Core plugins, currently accessible within the Semantic Kernel framework, encompass:

ConversationSummarySkill: Summarizing dialogues succinctly.
FileIOSkill: Facilitating filesystem interactions for reading and writing.
HttpSkill: Enabling seamless API calls.
MathSkill: Empowering mathematical computations.
TextMemorySkill: Storing and retrieving text from memory.
TextSkill: Ensuring deterministic manipulation of text strings.
TimeSkill: Gaining temporal insights, including time of day and related details.
WaitSkill: Temporarily suspending execution for defined intervals.

These Core plugins seamlessly integrate into your projects, supporting effortless importation and facile concatenation. For instance, the TextSkill offers a seamless sequence of operations, from trimming whitespace and converting to uppercase, then transitioning to lowercase—an illustrative instance of the plugin synergy at play.

In this application we will create all our plugins from scratch. It will be a combination of Semantic function and Native Functions. The below table will provide more details about the plugins.

Plugin	Function	Type	Description
summarizePlugin	NA	Semantic	Generating the Summary of the content
audioPlugin	create_audio_file	Native	Generate Audio file
promptPlugin	NA	Semantic	Generate DALL-E2 image prompts
imagePlugin	create_image_files	Native	Generate Image files
videoPlugin	create_video_files	Native	Generate Video file

Summarize Plugin

For the Summarize Plugin , create a new folder named as summarizePlugin within the myPlugins folder.

Create two more files there:

config.json
skprompt.txt

In config.json add the following content:

{
    "schema": 1,
    "description": "Summarize the content",
    "type": "completion",
    "completion": {
      "max_tokens": 1000,
      "temperature": 0,
      "top_p": 0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0
    },
    "input": {
      "parameters": [
        {
          "name": "input",
          "description": "Summarize the content",
          "defaultValue": ""
        }
      ]
    }
  }

In skprompt.txt:

{{$input}}
Summarize the content in less than 100 words.
Summary:

Audio Plugin

For the Audio Plugin , create a new folder named as audioPlugin within the myPlugins folder.

Create a new file named audioPlugin.py

import azure.cognitiveservices.speech as speechsdk
from semantic_kernel.skill_definition import (
    sk_function,
    sk_function_context_parameter,
)
from semantic_kernel.orchestration.sk_context import SKContext


class AudioPlugin:
    @SK_function(
        description="Creates a audio file with the given content",
        name="create_audio_file",
        input_description="The content to be converted to audio",
    )
    @SK_function_context_parameter(
        name="content",
        description="The content to be converted to audio",
    )
    @SK_function_context_parameter(
        name="speech_key",
        description="speech_key",
    )
    @SK_function_context_parameter(
        name="speech_region",
        description="speech_region",
    )
    def create_audio_file(self, context: SKContext):        
        speech_config = speechsdk.SpeechConfig(subscription=context["speech_key"], region=context["speech_region"])
        content = context["content"]
        filename = "Audio/audio.mp4"
        audio_config = speechsdk.AudioConfig(filename=filename)
        speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
        result = speech_synthesizer.speak_text_async(content).get()
        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print(f"Audio saved to {filename}")
        else:
            print(f"Error: {result.error_details}")
        print("Audio file created.....")

Explanation of the code :

The code above defines a Python class called AudioPlugin that contains a single method called create_audio_file. This method is decorated with several sk_function_context_parameter decorators that define the input parameters for the function.

The create_audio_file method takes a single input parameter called context, which is an instance of the SKContext class. This class is defined in another module and provides a way to pass context information between different parts of the application.

Within the create_audio_file method, the first thing that happens is that a SpeechConfig object is created using the speech_key and speech_region parameters from the context object. This object is used to configure the speech synthesis service that will be used to convert text to audio.

Next, the content parameter from the context object is retrieved and stored in a variable. This content will be converted to audio in the next step.

After that, a filename is specified for the output audio file. This file will be saved to the Audio directory with the name audio.mp4.

An AudioConfig object is then created using the filename specified in the previous step. This object is used to configure the audio output settings for the speech synthesis service.

Finally, a SpeechSynthesizer object is created using the SpeechConfig and AudioConfig objects. This object is used to synthesize the audio from the input text. The speak_text_async method is called on the SpeechSynthesizer object with the content parameter as input. The get method is then called on the result of this method to wait for the audio synthesis to complete.

If the audio synthesis is successful, the resulting audio file is saved to the specified filename and a message is printed to the console indicating that the audio file was created. If there is an error during the audio synthesis process, an error message is printed to the console instead.

Overall, this code defines a class that can be used to create audio files from text using the Azure Speech Services API. By using the SKContext class to pass in the necessary configuration information, this class can be easily integrated into other parts of the application.

Prompt Plugin

For the Prompt Plugin , create a new folder named as promptPlugin within the myPlugins folder.

Create two more files there:

config.json
skprompt.txt

In config.json add the following content:

{
    "schema": 1,
    "description": "Create Dalle prompt ideas",
    "type": "completion",
    "completion": {
      "max_tokens": 1000,
      "temperature": 0.9,
      "top_p": 0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0
    },
    "input": {
      "parameters": [
        {
          "name": "input",
          "description": "Create Dalle prompt ideas",
          "defaultValue": ""
        }
      ]
    }
  }

In skprompt.txt:

create 10 dall-e image prompts ideas on the below content
{{$input}}

Prompts:

Image Plugin

For the Image Plugin , create a new folder named as imagePlugin within the myPlugins folder.

Create a new file named imagePlugin.py

import requests
import time
import urllib.request
from semantic_kernel.skill_definition import (
    sk_function,
    sk_function_context_parameter,
)
from semantic_kernel.orchestration.sk_context import SKContext


class ImagePlugin:
    @SK_function(
        description="Creates images with the given prompts",
        name="create_image_files",
        input_description="The content to be converted to images",
    )
    @SK_function_context_parameter(
        name="prompts",
        description="The content to be converted to images",
    )
    @SK_function_context_parameter(
        name="api_base",
        description="api_base",
    )
    @SK_function_context_parameter(
        name="api_key",
        description="api_key",
    )
    @SK_function_context_parameter(
        name="api_version",
        description="api_version",
    )
    def create_image_files(self, context: SKContext):
        api_base = context["api_base"]
        api_key = context["api_key"]
        api_version = context["api_version"]
        url = "{}dalle/text-to-image?api-version={}".format(api_base, api_version)
        headers= { "api-key": api_key, "Content-Type": "application/json" }
        images = []
        counter = 0
        image_list = []
        for phrase in context["prompts"]:
            print("Image for: ",phrase)
            body = {
                "caption": phrase ,
                "resolution": "1024x1024"
            }
            submission = requests.post(url, headers=headers, json=body)
            operation_location = submission.headers['Operation-Location']
            retry_after = submission.headers['Retry-after']
            status = ""
            #while (status != "Succeeded"):
            time.sleep(int(retry_after))
            response = requests.get(operation_location, headers=headers)
            status = response.json()['status']
            #print(status)
            if status == "Succeeded":
                counter += 1
                image_url = response.json()['result']['contentUrl']
                filename = "Images/file" + str(counter) + ".jpg"
                urllib.request.urlretrieve(image_url, filename)

Explanation of the code:

The code above defines a Python class called ImagePlugin that contains a single method called create_image_files. This method is decorated with several sk_function_context_parameter decorators that define the input parameters for the function.

The create_image_files method takes a single input parameter called context, which is an instance of the SKContext class. This class is defined in another module and provides a way to pass context information between different parts of the application.

Within the create_image_files method, several variables are initialized using the values from the context object. These variables include the api_base, api_key, and api_version parameters, which are used to configure the DALL-E API that will be used to generate the images.

Next, a loop is started that iterates over each prompt in the prompts parameter of the context object. For each prompt, a POST request is sent to the DALL-E API with the prompt as the caption and a resolution of 1024x1024. The response from the API includes an Operation-Location header that contains the URL for the status of the image generation operation.

The code then waits for the number of seconds specified in the Retry-after header of the response before sending a GET request to the Operation-Location URL to check the status of the image generation operation. This process is repeated until the status of the operation is "Succeeded".

Once the image generation operation is successful, the resulting image is downloaded from the contentUrl specified in the response and saved to a file with a filename that includes a counter to ensure that each image has a unique filename.

Overall, this code defines a class that can be used to generate images from text using the DALL-E API. By using the SKContext class to pass in the necessary configuration information, this class can be easily integrated into other parts of the application.

Video Plugin

For the Video Plugin , create a new folder named as videoPlugin within the myPlugins folder.

Create a new file named videoPlugin.py

from moviepy.editor import *
from semantic_kernel.skill_definition import (
    sk_function    
)
from semantic_kernel.orchestration.sk_context import SKContext


class VideoPlugin:
    @SK_function(
        description="Creates images with the given prompts",
        name="create_video_file",
        input_description="The content to be converted to images",
    )
    def create_video_file(self, context: SKContext):
        images = []
        for i in range(1,11):
            images.append("Images/file" + str(i) + ".jpg")
        audio = "Audio/audio.mp4"
        output = "Video/video.mp4"
        self.create_video(images, audio, output)
        print("Video created.....")

    def create_video(self,images, audio, output):
        clips = [ImageClip(m).resize(height=1024).set_duration(3) for m in images]
        concat_clip = concatenate_videoclips(clips, method="compose")
        audio_clip = AudioFileClip(audio)
        final_clip = concat_clip.set_audio(audio_clip)
        final_clip.write_videofile(output, fps=20)

Explanation of the code:

The code above defines a Python class called VideoPlugin that contains a single method called create_video_file. This method is decorated with a sk_function decorator that defines the function as a skill that can be used in the Semantic Kernel.

The create_video_file method takes a single input parameter called context, which is an instance of the SKContext class. This class is defined in another module and provides a way to pass context information between different parts of the application.

Within the create_video_file method, several variables are initialized with the filenames of the images, audio, and output video. The images variable is a list of filenames for the images that will be used to create the video. The audio variable is the filename of the audio file that will be used as the soundtrack for the video. The output variable is the filename of the output video file.

The create_video method is then called with the images, audio, and output variables as input. This method uses the moviepy library to create a video from the input images and audio.

First, a list of ImageClip objects is created from the input images. Each clip is resized to a height of 1024 pixels and set to a duration of 3 seconds.

Next, the concatenate_videoclips function is used to concatenate the clips into a single video clip. The set_audio method is then called on the concatenated clip with an AudioFileClip object created from the input audio file as input.

Finally, the write_videofile method is called on the resulting clip with the output filename and a frame rate of 20 frames per second as input.

If the video creation is successful, a message is printed to the console indicating that the video was created.

Overall, this code defines a class that can be used to create videos from images and audio using the moviepy library. By using the SKContext class to pass in the necessary configuration information, this class can be easily integrated into other parts of the application.

Final Orchestrator

Now create the final orchestration , create a file named as text_to_video.py and paste the below content.

import os
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import AzureTextCompletion
from myPlugins.audioPlugin.audioPlugin import AudioPlugin
from myPlugins.imagePlugin.imagePlugin import ImagePlugin
from myPlugins.videoPlugin.videoPlugin import VideoPlugin
from dotenv import load_dotenv
import time


# Semantic functions are used to call the semantic skills
# 1. Summarize the input text
# 2. Create image prompts from the summary
def semanticFunctions(kernel, skills_directory, skill_name,input):
    functions = kernel.import_semantic_skill_from_directory(skills_directory, "myPlugins")
    summarizeFunction = functions[skill_name]
    return summarizeFunction(input)
    
# Native functions are used to call the native skills
# 1. Create audio from the summary
# 2. Create images from the image prompts
# 3. Create video from the images
def nativeFunctions(kernel, context, plugin_class,skill_name, function_name):
    native_plugin = kernel.import_skill(plugin_class, skill_name)
    function = native_plugin[function_name]    
    function.invoke(context=context) 

def main():
    
    #Load environment variables from .env file
    load_dotenv()

    # Create a new kernel
    kernel = sk.Kernel()
    context = kernel.create_new_context()

    # Configure AI service used by the kernel
    deployment, api_key, endpoint = sk.azure_openai_settings_from_dot_env()

    # Add the AI service to the kernel
    kernel.add_text_completion_service("dv", AzureTextCompletion(deployment, endpoint, api_key))

    # Getting user input
    user_input = input("Enter your content:")

    # Generating summary
    skills_directory = "."
    print("Generating the summary............... ")
    start = time.time()
    result_sum = semanticFunctions(kernel, skills_directory,"summarizePlugin",user_input).result.split('\n')[0]
    print("Time taken(secs): ", time.time() - start)
    

    # Generating audio
    print("Creating audio.................")    
    context["content"] = result_sum
    context["speech_key"] = os.getenv("SPEECH_KEY")
    context["speech_region"] = os.getenv("SPEECH_REGION")
    start = time.time()
    nativeFunctions(kernel, context, AudioPlugin(),"audio_plugin","create_audio_file")
    print("Time taken(secs): ", time.time() - start)

    # Generating image prompts
    print("Creating Dall-e prompts.................")
    start = time.time()
    image_prompts = semanticFunctions(kernel,skills_directory,"promptPlugin",result_sum).result.split('\n\n')[0].split("<")[0].split('\n')
    print("Time taken(secs): ", time.time() - start)

    # Generating images
    print("Creating images.................")
    context["prompts"] = image_prompts
    context["api_base"] = os.getenv("DALLE_API_BASE")
    context["api_key"] = os.getenv("DALLE_API_KEY")
    context["api_version"] = os.getenv("DALLE_API_VERSION")
    start = time.time()
    nativeFunctions(kernel, context, ImagePlugin(),"image_plugin","create_image_files")
    print("Time taken(secs): ", time.time() - start)
    
    # Generating video
    print("Creating video.................")
    start = time.time()
    nativeFunctions(kernel, context, VideoPlugin(),"video_plugin","create_video_file")
    print("Time taken(secs): ", time.time() - start)
  


if __name__ == "__main__":
    start = time.time()
    main()
    print("Time taken Overall(mins): ", (time.time() - start)/60)

Explanation of the code:

The code above is a Python script that generates a video from a user's input text using several plugins. The script uses the Semantic Kernel to call semantic and native skills that perform various tasks such as summarizing the input text, generating audio, creating image prompts, and creating a video from the images.

The script begins by importing several modules, including the semantic_kernel module, which provides the functionality for the Semantic Kernel, and several custom plugins that are used to generate audio, images, and videos. The dotenv module is also imported to load environment variables from a .env file.

Next, a new instance of the Semantic Kernel is created, and a new context is created within the kernel. The script then configures an AI service to be used by the kernel using the azure_openai_settings_from_dot_env function, which reads the necessary settings from the .env file.

The user is then prompted to enter their input text, which is used to generate a summary of the text using a semantic skill called summarizePlugin. The semanticFunctions function is used to call this skill and return the summary.

The script then generates audio from the summary using a native skill called audio_plugin. The nativeFunctions function is used to call this skill and pass in the necessary context information, including the summary text and the API key and region for the speech service.

Next, the script generates image prompts from the summary using another semantic skill called promptPlugin. The semanticFunctions function is used to call this skill and return the image prompts.

The script then generates images from the image prompts using a native skill called image_plugin. The nativeFunctions function is used to call this skill and pass in the necessary context information, including the image prompts and the API base, key, and version for the DALL-E API.

Finally, the script generates a video from the images using another native skill called video_plugin. The nativeFunctions function is used to call this skill and pass in the necessary context information, including the filenames of the images and audio and the output filename for the video.

The script also includes several print statements that provide information about the progress of the script, including the time taken to complete each step. The total time taken to complete the script is also printed at the end.

Overall, this script demonstrates how the Semantic Kernel can be used to call semantic and native skills to perform complex tasks such as generating a video from text. By using plugins that are specifically designed for each task, the script is able to generate high-quality audio, images, and video from the input text.

Conclusion

In this installment of our exploration into the convergence of Azure and OpenAI, we delved deeper into the world of possibilities with Semantic Kernel. Having ventured previously into transforming text into video using Azure's formidable AI capabilities, we've now embarked on a more intricate journey by orchestrating our application flow with the prowess of Semantic Kernel.

Semantic Kernel, a potent tool, empowers us to comprehend and manipulate textual meanings with a subtler touch. Through its deployment, we're able to fashion more intricate workflows and derive richer outcomes from the process of converting text to video.

Throughout this segment, we focused on the augmentation Semantic Kernel brings to our application, paving the way for a smoother, more efficient workflow. We've explored its features in depth, unveiled its advantages, and showcased how it has the potential to revolutionize the transformation of text into captivating video content.

A pivotal shift in our approach this time was the omission of the entity recognition step from our previous blog. This decision, carefully weighed, allowed us to streamline our process and channel our efforts into what truly matters - crafting impeccable video content from text.

Prepare for a thrilling voyage into the realm of Semantic Kernel and Azure. As we progress, remember that you can always refer to our previous post for more insights.

For those uninitiated with the technologies involved, fret not. We've got you covered with prerequisites outlined in the sections to come, providing you with the necessary foundation to embark on this journey.

As we forge ahead, remember that the synergy of Azure and OpenAI, combined with the modular marvels of Semantic Kernel, holds the potential to reshape how we perceive and harness the power of AI. So, buckle up and join us as we dive deeper into this exciting realm of innovation and transformation.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Transforming Text to Video: Azure Open AI, Cognitive Services and Semantic Kernel (Python)

Introduction

Prerequisite

Creating the Environment File

Creating the Folder Structure

Creating the Plugins

Summarize Plugin

Audio Plugin

Prompt Plugin

Image Plugin

Video Plugin

Final Orchestrator

Conclusion