The Street Sweeper Bot: How Semantic Kernel Turns Telegram Chats into Actionable City Triage

vikaspandey

Microsoft

Nov 19, 2025

The promise of AI isn't just about efficiency; it's about empowerment. Imagine a world where reporting a pile of illegally dumped garbage on the side of the road is as simple as snapping a photo or recording a short video. This isn't a futuristic dream—it's the reality we can build today with intelligent, multi-modal bots powered by Microsoft Semantic Kernel and the Azure ecosystem. In this post, I'll walk you through building a CivicBot—a smart assistant designed to streamline citizen reporting via Telegram. This bot doesn't just "listen"; it understands intent, processes images and audio, and autonomously manages the lifecycle of a support ticket.

Project Objectives & Impact

The core objectives of this CivicBot project are:

Simplify Reporting: Make it incredibly easy for citizens to report issues via a familiar platform like Telegram.
Enhance Accessibility: Support diverse inputs (text, audio, images) to cater to all users.
Automate Triage & Logging: Automatically categorize issues, extract details, and log them into a structured database.
Enforce Guardrails: Use AI to strictly filter out irrelevant or inappropriate submissions, focusing on valid civic issues.

How it helps: By allowing users to simply send a photo of garbage on the road, the system drastically reduces the friction of civic engagement and ensures city services receive accurate, timestamped, and location-contextualized evidence, leading to faster resolution times.

Pre-requisite: Getting Started

1.Getting Your Telegram Bot Token

The Telegram Bot API requires a unique token to authenticate your application. You obtain this token by interacting with Telegram's official bot creation tool, BotFather.

Here are the steps to get your required TelegramBotToken:

Find BotFather: Open Telegram and search for the verified user @BotFather.
Start the Conversation: Send the command /start to BotFather.
Create a New Bot: Send the command /newbot.
Name Your Bot: BotFather will ask for a display name (e.g., "Civic Reporting Bot"). Enter your desired name.
Set a Username: BotFather will then ask for a unique username. This must end with the word "bot" (e.g., CivicBot_Demo). Enter a unique username.
Receive the Token: Upon success, BotFather will provide you with a single string of text: "Use this token to access the HTTP API:" followed by the token string.
Store the Token: Copy this token. This is your TelegramBotToken that you will configure in your application settings (e.g., appsettings.json or Azure App Service configuration).

2. Downloading and Configuring FFmpeg

Since Telegram uses the OGG/Opus audio format, we use the open-source tool FFmpeg to convert the files into the WAV format required by Azure AI Speech-to-Text.

Download FFmpeg: Navigate to the official FFmpeg site and download the latest stable build for Windows.
Extract the Executable: The download is typically a ZIP file. Extract the entire files.
Create Folder: In your project's root directory, create a new folder named ffmpeg.
Place Executable: Place the ffmpeg.exe file and all associated files directly into the newly created ffmpeg folder.
Configure appsettings.json: Add the following key-value pair to your appsettings.json file. This is how the C# code locates the executable during runtime:

Architecture Overview

The architecture is a symphony of Azure services, orchestrated by a .NET Core application hosted on Azure App Service.

Technical Deep Dive: The Semantic Kernel Components

The core intelligence is driven by the Semantic Kernel Orchestrator, which coordinates five distinct plugins (components):

Component Name (in Code)	Simplified Role	Core Functionality
SessionManagementPlugin	Orchestrator (Decision Logic)	Uses LLM reasoning to determine the Next Best Action: CREATE_TICKET, COLLECT_MORE_INFO, REQUEST_CONFIRMATION, etc.
AIAnalysisPlugin	Image / Audio (Intelligence)	Contains Semantic Functions (prompts) for content classification, guardrails enforcement, and analysis via the AI Foundry Vision Endpoint.
MediaProcessingPlugin	Audio / Image (Handling)	Manages the file lifecycle, including preparation and delegation of files for analysis.
CosmosDbPlugin	Ticketing (Persistence)	Saves the final ComplaintTicket object and conversation history to Azure Cosmos DB.
BlobStoragePlugin	Ticketing (Evidence Storage)	Uploads raw media files to Azure Blob Storage, returning a secure URL for the ticket record.

Multi-Modal Pre-Processing: The FFmpeg Step

A critical detail for handling Telegram audio (OGG/Opus format) is the conversion step required by Azure AI Speech-to-Text. We integrate FFmpeg directly into our application's Audio processing logic, ensuring we send only compatible streams to the AI service.

Conversation Screenshots

// Snippet from TelegramBotUpdateHandler.cs demonstrating the FFmpeg conversion step
using (var proc = new System.Diagnostics.Process())
{
    // ... setup for FFmpeg path
    // Convert input (OGG) to 16kHz mono WAV for Azure Speech
    proc.StartInfo.Arguments = $"-y -i \"{tempIn}\" -ar 16000 -ac 1 -f wav \"{tempWav}\""; 
    // Execute conversion...
}
// The converted stream is then passed to the Azure AI Speech SDK.

User sending a photo of the garbage, followed by the bot's confirmation prompt.

In this scenario, Bot has processed the image and then responded with validation to proceed

Console logs while chatting in this scenario

Cosmos DB Logs post this conversation

{
  "id": "6672d3d5-48c6-42d7-90c4-05bda042a4d4",
  "userId": 801,
  "chatId": 801,
  "userFullName": "K T",
  "mobileNumber": null,
  "textSummary": "Location: Street 16, Sector 19, Gurugram \nIssue: Garbage \nSeverity: High; Location: Street 16, Sector 19, Gurugram \nIssue: Garbage removal \nSeverity: High",
  "attachments": [
    {
      "fileType": "Image",
      "blobUrl": "https://abc.blob.core.windows.net/complaints/fc3bb274-5566-47a8-b56c-ae2577eed040.jpeg",
      "caption": "Please help removing this garbage from Street 16, Sector 19, Gurugram",
      "Transcript": null
    }
  ],
  "conversations": [
    {
      "role": "user",
      "content": "/start",
      "timestamp": "2025-11-18T18:45:27.5063715Z",
      "messageType": "text"
    },
    {
      "role": "assistant",
      "content": "Hello there! I'm CivicBot, your friendly sanitation assistant here to help with any garbage, waste, drains, or sewage concerns you may have. What can I assist you with today?",
      "timestamp": "2025-11-18T18:45:31.0887884Z",
      "messageType": "text"
    },
    {
      "role": "user",
      "content": "Please help removing this garbage from Street 16, Sector 19, Gurugram",
      "timestamp": "2025-11-18T18:46:08.1713574Z",
      "messageType": "photo"
    },
    {
      "role": "assistant",
      "content": "I've noted your request for garbage removal on Street 16, Sector 19, Gurugram, marked as a high-severity issue. Ready to log this?",
      "timestamp": "2025-11-18T18:46:27.8353554Z",
      "messageType": "text"
    },
    {
      "role": "user",
      "content": "Yes",
      "timestamp": "2025-11-18T18:46:45.242703Z",
      "messageType": "text"
    }
  ],
  "status": "New",
  "timestamp": "2025-11-18T18:45:27.5038736Z",
  "_rid": "UAwVAKY1fTQyAAAAAAAAAA==",
  "_self": "dbs/UAwVAA==/colls/UAwVAKY1fTQ=/docs/UAwVAKY1fTQyAAAAAAAAAA==/",
  "_etag": "\"0301a222-0000-0800-0000-691cbf1b0000\"",
  "_attachments": "attachments/",
  "_ts": 1763491611
}

2. User sending a voice note (Audio) followed by the bot's transcription and analysis]

In this scenario, user is replying via Audio note, still bot understands it as a context for processing tickets.

3) User sending a photo out of the topic not related to Civic issues, followed by the bot's confirmation prompt.

In this scenario Bot has gracefully redirected to the desired topic after analysing via Azure AI Vision

The typical flow is:

User Action: The citizen sends a photo or audio recording of the problem.
Bot Processing: The media is processed (FFmpeg for audio, AI Foundry Vision Endpoint for images) and analyzed.
Bot Response: The Orchestrator asks for confirmation: "I see a large pile of garbage blocking the road. Should I log this immediately?"
Resolution: The user confirms, and the Ticketing component logs the ticket to Azure Cosmos DB, returning a tracking ID.

Future Enhancements: The Path to Agentic Excellence

The Semantic Kernel architecture is robust, but it provides an excellent platform for future evolution. Our roadmap includes two critical technical upgrades:

1. Advanced Video Processing with Azure Video Indexer

Our current solution uses placeholder functions for processing Video and VideoNote files. A critical upgrade involves integrating Azure Video Indexer. This shift will enable:

Object and Scene Detection: Identifying key visual evidence like "garbage," "piles," or "blocked street" directly within video frames.
Keyframe Extraction: Automatically selecting the most relevant visual moments to serve as high-confidence evidence attached to the final ticket.
Actionable Insights: Moving beyond simple transcription to provide deep visual and auditory analysis of the recorded evidence.

2. Migrating to the Microsoft Agent Framework

While Semantic Kernel efficiently manages our single-agent orchestrator, scaling to a highly sophisticated system would benefit from the Microsoft Agent Framework (MAF). The MAF is designed for multi-agent workflows, allowing us to:

Implement Complex Routing: Move from a single router to a framework that manages conversational flow across multiple, specialized agents (e.g., a Triage Agent hands off to a Confirmation Agent).
Parallelized Agents: Deploy dedicated agents for parallel tasks, such as a "Historical Lookup Agent" or an "Address Verification Agent," all coordinated by a central MAF supervisor.

Conclusion

By combining Semantic Kernel with specialized services like the AI Foundry Vision Endpoint and Azure AI Speech, we moved beyond simple chatbots to a true Agentic Workflow. The bot doesn't just chat; it sees the garbage, converts the audio, understands the urgency, and acts to file the report. This architecture is scalable, secure, and, most importantly, it lowers the barrier for citizens to keep their cities clean.