Microsoft Foundry Blog

6 MIN READ

Azure Speech: Dynamic Language Switching with Connected/Disconnected Containers

Agustinmantaras

Microsoft

Sep 02, 2025

What’s Missing and How I Built a Practical Workaround (Arabic-English scenario)

Introduction

Gap: Azure Speech Containers (disconnected or not) do not currently support dynamic multi-language switching inside a single STT session. The cloud service does (via auto language detection in the recognizer), but the offline containers require a fixed language per session.
Impact: Any audio that mixes languages (e.g., English ↔ Arabic) cannot be transcribed with one continuous container recognizer that “auto-switches”.
Workaround we implemented:
1. Run Language Identification (LID) in a container to detect language segments in real time.
2. Maintain one active cloud STT recognizer at a time, streaming the appropriate audio portion to it.
3. On each LID language switch, gracefully stop the current recognizer, start a new one with the new language, and continue streaming.
4. Merge results and write out a single, ordered transcript.

This gives you near-live behavior (partial + final hypotheses) and high accuracy, while keeping the LID fully local and only using the cloud for transcription.

Why dynamic switching is tricky with disconnected containers

The behavior users want is: “listen once, detect languages on the fly, and switch STT models automatically without restarting the session.” Today:

Cloud STT supports auto-detecting multiple languages and will continuously decode while switching (within constraints).
STT containers expect one language per recognizer session. There is no supported stateful “switch language mid-stream” in a single container recognizer. Practically, you would need to close the recognizer and recreate it with a new language argument—an operation that is too slow and brittle for seamless live transcription.

Consequence: If your use case requires dynamic, mid-utterance language switching and you must run disconnected, the containers alone won’t deliver the seamless experience.

Design goals

Low latency: Start producing text as soon as speech is detected, not after the entire file is analyzed.
Stable switching: Don’t flap between languages; switch only when the LID signal is clear.
Good accuracy: Use the best decoder per language.
Operationally simple: One script, minimal external dependencies, sane logging.

The architecture

             ┌─────────────────────────────────────────────┐
             │                 Your App                    │
             │   (Controller + StreamingTranscriber)       │
             └──────────────┬──────────────────────────────┘
                            │
                            │ LID events (lang + time spans)
                            ▼
          ┌────────────────────────────────┐
          │         LID Container          │
          │   host=ws://localhost:5003     │
          │   (no resource path)           │
          └────────────────────────────────┘

   language=en-US                                 language=ar-SA
        │                                                │
        │ PCM (push)                                     │ PCM (push)
        ▼                                                ▼
┌────────────────────────┐                    ┌────────────────────────┐
│  STT Container (EN)    │                    │  STT Container (AR)    │
│  host=ws://localhost:5004                   │  host=ws://localhost:5005
│  recognizer.lang=en-US  │                   │  recognizer.lang=ar-SA  │
└────────────────────────┘                    └────────────────────────┘

          ▲                                                ▲
          │ transcript (partials + finals)                 │ transcript
          └────────────────────────────────────────────────┘
                                 merged → final JSON + text

Key idea: Only one active recognizer at a time. The controller streams a moving “budget” of audio frames up to the next language boundary. When LID says the language just changed, we stop, save the final text, then start a new recognizer for the next language.

What happens at runtime

LID streams over the WAV and emits language spans (e.g., en-US 0–5.5s, ar-SA 5.5–8.5s, …).
Your controller starts exactly one active StreamingTranscriber at a time.
While the language remains the same, it keeps pushing audio to the matching STT container.
On language change, it stops the current recognizer (saving a small tail overlap to avoid clipping), aggregates the transcript, and starts the next recognizer pointing at the other container (or the same container with a changed language).
Results are merged in order and written as JSON + a readable transcript.

Controller/Timing Graph

Time  ─────────────────────────────────────────────────────────────────►
LID   [en-US: 0.0 ─ 5.5] [ar-SA: 5.5 ─ 8.5] [en-US: 8.5 ─ 11.5] [ar-SA: …]

STT(en)  ┌──────── stream to EN container ────────┐        ┌─── stream ──┐
          │ partials: "~ hello …" finals: "…"      │        │ partials…   │
          └──────────── stop (tail +200ms) ────────┘        └─ stop … ────┘

STT(ar)                         ┌── stream to AR container ───┐
                                │ partials: "~ لديه …" finals │
                                └───────── stop (tail) ───────┘

Tail overlap (≈200ms) on stop helps avoid chopping the last phoneme.
Chunk size (e.g., 40ms) lets partials feel responsive.
Switching is instantaneous from the app perspective: it’s just stopping one stream and starting the other.

The core building blocks (what the code does)

1) LIDStream — Continuous language identification (container)

Connects to your LID container using SpeechConfig(host=ws://host:port) (note: host only, no resource path).
Emits events of the form (language, start_hns, end_hns) while scanning the file.
We log every detection and pass it to the controller.

Why container here?
You wanted LID to run locally/offline. The container does this well and yields time-aligned segments.

2) StreamingTranscriber — A single, active streaming session

When started, it spawns a writer thread that:
- Reads WAV frames from the file, at real-time-ish chunk size (default 40 ms)
- Converts to PCM16 mono 16 kHz (via audioop, OK on Python 3.12)
- Pushes to the recognizer input stream
Keeps an internal “latest end” budget that the controller can extend as LID emits more time for the current language.
On stop, we add a small tail overlap (e.g., 200 ms) to avoid truncating word endings at boundaries.
Logs partials (~) and finals (✔) in real time. On stop, returns the final combined text for that language span.

Why a moving budget?
LID detections arrive continuously. We push audio up to the most recent end-time for the current language, keeping latency low. When the language flips, we stop exactly at the boundary.

4) Controller — The finite-state machine that orchestrates switching

Logic in plain English:

First LID event starts the first transcriber for that language at that start time.
Subsequent LID events extend the same language’s end time (we keep streaming).
When LID reports a new language, we:
- Stop the current transcriber → persist its final text.
- Record the elapsed segment (start, end, language).
- Start a new transcriber for the new language at the new start.
At the end of the file or LID session, we stop whatever is active and finalize.

This yields:

A sequence of recognized blocks (language spans) with clean, ordered text.
A merged LID segments.json for auditing.
A final transcript string that concatenates recognized spans, e.g.:
[en-US] … [ar-SA] … [en-US] …

The sequence on the timeline

Time  ───────────────────────────────────────────────────────────────────►
Lang  en-US                 ar-SA      en-US          ar-SA         en-US

LID   [---------detected------][--][------detected------][---][---…]
STT   └─── start en-US ────┐   │  └─ stop ─┐  start en-US…      …
                           └─ switch to ar-SA ─┘    └─ switch to ar-SA ─┘

Tail overlap at each stop prevents word clipping at boundaries.

Step-by-step walkthrough of the (working) code

The filenames/classes below match the working implementation you ran:
LIDStream, AzureCloudSTT, StreamingTranscriber, and the top-level process_audio_file.

Initialize logging and env
- Load .env (if present) without overwriting real env.
- Verbose logs write to console and fixed_module_test.txt.
Spin up LIDStream
- SpeechConfig(host=ws://localhost:5003) (no resource path).
- AutoDetectSourceLanguageConfig(languages=[…]).
- Subscribe to recognized to receive offset/duration JSON → push (lang, start, end) to the controller.
Create AzureCloudSTT
- From env: SPEECH_KEY + SPEECH_REGION or SPEECH_ENDPOINT.
- Sanity checks: if endpoint present without /speech/… path and region exists, prefer region mode.
- Returns a configured SpeechRecognizer per language when needed.
On first LID event
- Start StreamingTranscriber with (language, start_hns).
- It starts a push stream and a writer thread, pushing PCM at chunk_ms cadence.
On repeated events of the same language
- Extend the “latest end” budget in the transcriber. It keeps pushing.
On language change
- Stop the active transcriber, collect finals (combined text), append to the results.
- Persist a segment record with (language, start_hns, end_hns).
- Start a new transcriber for the new language at the new start.

On completion

Stop any remaining transcriber, finalize its text.
Merge all raw LID events into a compact segments array and write --segments.

Assemble the full transcript string and write --output JSON:

{
  "audio_file": "...",
  "segment_count": N,
  "segments": [ { "Language": "en-US", "StartTimeHns": ..., "DurationHns": ... }, ... ],
  "recognized_results": [
    { "language": "en-US", "start_hns": ..., "end_hns": ..., "transcript": "..." },
    ...
  ],
  "full_transcript": "[en-US] ... [ar-SA] ... ..."
}

Optional incremental saves
- With --flush-after-stop, we write the current state of the transcript after every language stop, so long files produce usable output while they run.

Practical tuning knobs

--chunk-ms (default: 40): Smaller chunks → snappier partials but more CPU.
--tail-overlap-ms (default: 200): Helps prevent cutting mid-phoneme at switches.
--min-segment-sec: If you want to ignore ultra-short LID blips at the controller layer, raise this.
Env: prefer REGION mode (stable) unless you specifically need a custom ENDPOINT.

Troubleshooting checklist

Cloud auth: ensure SPEECH_KEY + SPEECH_REGION (or valid SPEECH_ENDPOINT). No ws:// for cloud.
Container LID URL: must be host=ws://host:port (no /speech/... path).
SPXERR_SWITCH_MODE_NOT_ALLOWED: don’t “switch” a recognizer’s language; stop and create a new recognizer.
No text in finals: verify the source WAV sample rate/channels; our writer converts to PCM16 mono 16 kHz before push.

What this approach doesn’t do (yet)

True, single-session auto-switch in a disconnected STT container: not supported today.
Perfect diarization across languages: we track languages and times, not speakers.
Live microphone input: the same push-stream concept works for mic frames, but this article focuses on WAV.

Where to take it next

Backpressure & queueing: add a bounded queue between LID and the controller if you switch to live sources.
Mic/live input: attach an input capture thread feeding the same push-stream; the rest of the logic remains unchanged.
Resampling without audio loop: future-proof for Python 3.13+ using ffmpeg or resampy.

Access the SAMPLE code:

Container Language Identification

Final thoughts

If you need dynamic bilingual transcription with an offline LID requirement, this hybrid design is a practical, production-friendly solution today: