Integrating Azure AI Speech Services into Unity for a Seamless Speech-to-Text Experience

Copper Contributor

Oct 03, 2024

Have you ever planned to make a virtual assistant for your game? Or something like where the player can converse with an NPC just like in the real world? Well, the possibilities can be endless when you plan to integrate the power of Artificial Intelligence within your game and luckily Azure AI Services provides a variety of tools that you can utilise within the Unity Game Engine. In this blog, I’ll guide you through how you can use Azure AI Speech in Unity by making a simple Speech to Text feature.

Prerequisites:

Unity 2020.3 or later.
A subscription key for the Azure Speech service. (You can try Azure for free)
A working microphone and access to it.

Creating an Azure AI Speech resource:

Go to Azure AI Speech Studio and sign in with your Microsoft Account.
Select Settings then Create a resource. Configure it with the following settings:
- Name of new resource: Enter a unique name.
- Subscription: Your Azure subscription.
- Region: Select a supported region.
- Pricing tier: Free FO (if available, otherwise select Standard S0).
- Resource group: Select or create a resource group with a unique name.
Select Create resource. Wait until the resource has been created and then select Use resource. The Get started with Speech page is displayed.

You can explore the Speech Studio and check out different tools but for now, we’ll be using the Real-Time Speech to Text.

Check out Microsoft Learn - Azure AI Speech to explore more.

Copy the Region and Resource Key

Click the Settings icon from the top right corner.
Copy the “region” and “resource key”.

Setting up the Unity Project

Create a new 3D Project in Unity.
In the Hierrarchy panel, right click, go to UI and select “Text - TextMeshPro”.
In the Hierrarchy panel, right click, go to UI and select “Button - TextMeshPro”.
Position these two components in the scene view.
The “Text” will be used to display your recognised speech.
The “Button” will be used to record your speech.

Importing the Speech SDK to our Unity Project

Download the Speech SDK from here.
After downloading, Import the Speech SDK by selecting Assets > Import Package > Custom Package (Or you can just double click the downloaded package and open it in the current project).
Ensure that all files are selected and click Import.

Creating the script

In the Assets folder, right click and Create a C# Script named “Speech.cs”
Copy and paste the code below -

using UnityEngine;
using UnityEngine.UI;
using Microsoft.CognitiveServices.Speech;
using TMPro;
#if PLATFORM_ANDROID
using UnityEngine.Android;
#endif
#if PLATFORM_IOS
using UnityEngine.iOS;
using System.Collections;
#endif

public class HelloWorld : MonoBehaviour
{
    // Hook up the two properties below with a Text and Button object in your UI.
    public TextMeshProUGUI outputText;

    public Button startRecoButton;

    private object threadLocker = new object();
    private bool waitingForReco;
    private string message;

    private bool micPermissionGranted = false;

#if PLATFORM_ANDROID || PLATFORM_IOS
    // Required to manifest microphone permission, cf.
    // https://docs.unity3d.com/Manual/android-manifest.html
    private Microphone mic;
#endif

    public async void ButtonClick()
    {
        // Creates an instance of a speech config with specified subscription key and service region.
        // Replace with your own subscription key and service region (e.g., "westus").
        var config = SpeechConfig.FromSubscription("", "");

        // Make sure to dispose the recognizer after use!
        using (var recognizer = new SpeechRecognizer(config))
        {
            lock (threadLocker)
            {
                waitingForReco = true;
            }

            // Starts speech recognition, and returns after a single utterance is recognized. The end of a
            // single utterance is determined by listening for silence at the end or until a maximum of 15
            // seconds of audio is processed.  The task returns the recognition text as result.
            // Note: Since RecognizeOnceAsync() returns only a single utterance, it is suitable only for single
            // shot recognition like command or query.
            // For long-running multi-utterance recognition, use StartContinuousRecognitionAsync() instead.
            var result = await recognizer.RecognizeOnceAsync().ConfigureAwait(false);

            // Checks result.
            string newMessage = string.Empty;
            if (result.Reason == ResultReason.RecognizedSpeech)
            {
                newMessage = result.Text;
            }
            else if (result.Reason == ResultReason.NoMatch)
            {
                newMessage = "NOMATCH: Speech could not be recognized.";
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = CancellationDetails.FromResult(result);
                newMessage = $"CANCELED: Reason={cancellation.Reason} ErrorDetails={cancellation.ErrorDetails}";
            }

            lock (threadLocker)
            {
                message = newMessage;
                waitingForReco = false;
            }
        }
    }

    void Start()
    {
        if (outputText == null)
        {
            UnityEngine.Debug.LogError("outputText property is null! Assign a UI Text element to it.");
        }
        else if (startRecoButton == null)
        {
            message = "startRecoButton property is null! Assign a UI Button to it.";
            UnityEngine.Debug.LogError(message);
        }
        else
        {
            // Continue with normal initialization, Text and Button objects are present.
#if PLATFORM_ANDROID
            // Request to use the microphone, cf.
            // https://docs.unity3d.com/Manual/android-RequestingPermissions.html
            message = "Waiting for mic permission";
            if (!Permission.HasUserAuthorizedPermission(Permission.Microphone))
            {
                Permission.RequestUserPermission(Permission.Microphone);
            }
#elif PLATFORM_IOS
            if (!Application.HasUserAuthorization(UserAuthorization.Microphone))
            {
                Application.RequestUserAuthorization(UserAuthorization.Microphone);
            }
#else
            micPermissionGranted = true;
            message = "Click button to recognize speech";
#endif
            startRecoButton.onClick.AddListener(ButtonClick);
        }
    }

    void Update()
    {
#if PLATFORM_ANDROID
        if (!micPermissionGranted && Permission.HasUserAuthorizedPermission(Permission.Microphone))
        {
            micPermissionGranted = true;
            message = "Click button to recognize speech";
        }
#elif PLATFORM_IOS
        if (!micPermissionGranted && Application.HasUserAuthorization(UserAuthorization.Microphone))
        {
            micPermissionGranted = true;
            message = "Click button to recognize speech";
        }
#endif

        lock (threadLocker)
        {
            if (startRecoButton != null)
            {
                startRecoButton.interactable = !waitingForReco && micPermissionGranted;
            }
            if (outputText != null)
            {
                outputText.text = message;
            }
        }
    }
}
// </code>

Enter your copied "Subscription key" and "Region" in line 36

Save the script and go back to Unity.

3. Attach the script to the Canvas Gameobject.

4. In place of “Output Text” and “StartRecordButton” drag the Text and the Button component that you created in the scene respectively.

Testing

Now click the play button on Unity to enter Gamemode.
In the Gamemode click the Button and speak to your mic.
You should see your speech converted into text.

In this way, you can easily integrate the Azure AI Speech and use it for Speech to Text. You can further modify it as per your own need and to make your imagination into reality.

Learn more about the different use cases of Azure AI Speech and Text to Speech to understand its full potential. Here are the resources -