In standard web development, a 'success' is a 200 OK status. In Generative AI, a 200 OK can still contain a hallucination, a refusal to answer, or a gibberish output. Transparency is not a luxury in the Generative AI application lifecycle, it is a requirement. Let’s assume a scenario where a we are supposed to develop simple chat application (like the one in the code) that lets users chat with a powerful LLM. It seems fast, but sometimes users report lag, or we see high costs. When a user types a query and hits Enter, what happens in that 1.5-second delay?
In standard web development, a 'success' is a 200 OK status. In Generative AI, a 200 OK can still contain a hallucination, a refusal to answer, or a gibberish output. Transparency is not a luxury in the Generative AI application lifecycle, it is a requirement. As we discussed in our previous post, tracing is the mechanism that grants this transparency. We looked at the high-level benefits: cost control, performance monitoring, and the ability to iterate with confidence.
However, applying these concepts to the stochastic nature of LLMs requires a specific approach. How to trace a chain of reasoning? How to pinpoint a hallucination in a complex retrieval-augmented generation (RAG) flow? In this article, we move from theory to practice. We will implement tracing for Language Model calls, ensuring there is a granular visibility required to take an AI application from a cool demo to an enterprise-grade product.
Core Concepts:
Tracing is essentially like giving a single user request a GPS tracker as it travels through all the different services and functions that make up an application. It lets the owners to see exactly what happened, where, and how long it took.
Tracing: Complete view- Traces:
- It’s a complete, end-to-end journey of a single operation or request. It records the entire path—from when a user clicks a button to when the final result is returned. It’s a collection of related operations.
- In the image (Highlighted in red): The top element, “chat_interaction 1.94s”, represents the entire Trace. It's the full context of the user asking "Hi" and getting a response, which took 1.94 seconds.
- Spans:
- Each trace is composed of multiple spans. A span represents a single, discrete unit of work within the system, such as a function call, a database query (e.g., to a vector store in a RAG system), an API request to an LLM, or a data transformation step. Spans are the building blocks of a trace. Each span captures its own start time, end time, and duration. By looking at nested spans, you can see the call hierarchy—which function called which other function.
- In the image(Highlighted in green):
- chat_interaction 1.94s is the Root Span.
- llm_completion 1.94s is a child span, showing the time spent specifically on the Large Language Model part of the interaction.
- Chat qwen2-1.5b-instruct-generic-cpu:gpu3 is a more granular span inside the llm_completion, showing the actual model execution. This nesting shows the sequence of operations.
- Attributes / Metadata (The Details and Context):
-
- Attributes are just key-value pairs that add important details and context to a span or trace. These can be anything useful, like the user ID, the function's input parameters, the return value, an error message, or the name of the model being used. They help in understanding why a step took a certain amount of time.
- Spans (Highlighted in blue): The right-hand panel shows the Metadata (Attributes) for the chat_interaction span. Key attributes shown are:
- "user_query": "Hi" (The input that started the trace).
- "model": "qwen2-1.5-b-instruct-generic-cpu3" (The specific AI model used).
- "start_time" and "end_time" (Which define the span's duration).
- Semantic Conventions:
-
- These are rules for naming attributes and operations so that everyone uses the same "language."
- One tool calls the attribute for a database query db.query and another calls it database_operation, they can't easily talk to each other. Semantic Conventions standardize this (e.g., using db.statement for the query text) so tools like OpenTelemetry can work with data from any system.
- Trace Exporters (Sending the Data Out):
-
- An Exporter is a piece of code that takes the collected trace data and sends it to a backend system where it can be stored, viewed, and analyzed (like a central dashboard).
- It enables users to view the trace data outside of the application that created it. Azure AI, for example, uses an exporter to send trace data to Azure Monitor.
Confused? Let's simplify this with an analogy. Tracing is Like Ordering Food at a Restaurant. Imagine placing a single order at a restaurant (the request). The process of preparing and delivering that order is the Trace.
|
Tracing Concept |
Restaurant Analogy |
Explanation |
|
Trace |
Your Complete Food Order |
The entire journey from the moment “I would like a burger and fries" until the food is placed. This is the whole user experience. |
|
Span |
A Single Kitchen Task |
The individual steps needed to fulfill the order. Each one has a start and end time. |
|
|
e.g., "Grill the patty" (5 min) |
|
|
|
e.g., "Toast the bun" (2 min) |
|
|
|
e.g., "Assemble the burger" (3 min) |
These tasks are often nested (like how llm_completion contains the Chat model span). For example, Prepare the Burger is a span that contains the child spans: Grill the Patty and Toast the Bun. |
|
Attributes |
The Details on the Ticket |
Key-value pairs that give context to a specific task (Span). |
|
|
Attached to the "Grill the patty" Span: |
|
|
|
patty_type: "Veg" |
(The function parameter) |
|
|
cook_level: "Medium Rare" |
(A custom annotation) |
|
|
Attached to the whole Trace: |
|
|
|
customer_id: 405 |
(The identifier of the user who made the request) |
Let’s now implement tracing and use AI Toolkit to view the details of each step. Let’s assume a scenario where a we are supposed to develop simple chat application (like the one in the code) that lets users chat with a powerful LLM. It seems fast, but sometimes users report lag, or we see high costs. When a user types a query and hits Enter, what happens in that 1.5-second delay?
- Is the delay in the frontend?
- Is it network latency to the model server?
- Is the model itself taking too long?
Open Telemetry:
Open Telemetry is an open-source observability framework that has become the vendor-agnostic standard for generating and collecting telemetry data across distributed systems. It unifies the three main "pillars" of observability:
- Traces
- Metrics
- Logs
Before OTel, if one wanted to monitor application using a specific commercial tool, they had to use that vendor's proprietary libraries or agents. This led to vendor lock-in and would lead to some problems like,
- Switching of monitoring tools, one would have to re-instrument your entire application's code.
- Mixing and matching services that used different vendor agents resulted in confusing data silos.
OpenTelemetry solves this by providing a standard set of APIs, SDKs, and a common data format (OTLP - OpenTelemetry Protocol) that is entirely independent of the backend we choose.
So, in simple terms:
- We instrument code once using the OTel standard (as we will be doing with the Python SDK in this tutorial).
- Application is now generating standardized data.
- Use the OpenTelemetry Collector (or a simple exporter) to stream that data to any compatible backend system (e.g., VS Code AI Toolkit)
Setting up Our Tracer (The Observer) with the AI Toolkit
To move our chat application from a black box to a fully observable service, we need a standardized mechanism to capture telemetry data. This mechanism is OpenTelemetry (OTel), and our viewing window is the AI Toolkit for Visual Studio Code.
This configuration code is the heart of our setup, establishing the official pipeline to collect trace and log data and ensuring it's sent to the AI Toolkit for analysis
The observability journey begins with OpenTelemetry (OTel), the universal standard for generating data. Our goal is to configure our simple Python chat app to speak OTel's language and send its data somewhere useful—that "somewhere" is the AI Toolkit.
The Local Collector and Exporter
-
The Local Collector (AI Toolkit's Hidden Feature)
The AI Toolkit for VS Code provides a critical service: it hosts a local OTLP-compatible server on http://localhost:4318. By clicking "Start Collector" in its Tracing WebView, we get an instant, dedicated place to send telemetry data. This eliminates the headache of having to set up a separate Jaeger or other collector just to debug locally.
-
The Exporters (Sending the Data Out)
The code uses the OTLPSpanExporter and OTLPLogExporter to send the raw data to that local server.- Following is a sample code
# 1. Define the Application's Identity (The Resource) resource = Resource(attributes={ "service.name": "streamlit-chat-app" }) # 2. Configure the Tracing Pipeline provider = TracerProvider(resource=resource) otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces") processor = BatchSpanProcessor(otlp_exporter) provider.add_span_processor(processor) trace.set_tracer_provider(provider) # 3. Configure Auto-Instrumentation from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor OpenAIInstrumentor().instrument()
| Component | Simple Explanation | AI Toolkit Integration |
| Resource | It names our application: streamlit-chat-app. | This name is displayed in the AI Toolkit's Tracing view as the source, allowing us to filter traces easily. |
| TracerProvider | The factory that creates all tracing objects (spans). | It manages the data that the Toolkit will ultimately visualize. |
| OTLPSpanExporter | The data shipper. It formats all trace data using the standard OTLP (OpenTelemetry Protocol). | It's configured to send data to the Toolkit's built-in collector endpoint: http://localhost:4318/v1/traces. |
| BatchSpanProcessor | The efficiency manager. It buffers spans and sends them in batches. | This ensures our app runs fast by minimizing the performance hit of sending data. |
| OpenAIInstrumentor | Auto-instrumentation magic. It automatically wraps all client.chat.completions.create calls. | This gives us rich, built-in spans showing crucial GenAI attributes like token usage and API latency without writing a single manual span. |
The AI Toolkit for Visual Studio Code is the key component that transforms this raw configuration into a powerful debugging environment.
- Built-in OTLP Collector: The Toolkit eliminates the need to install and manage a separate OpenTelemetry collector (like Jaeger or the OTel Collector service). It provides a ready-made receiver listening on the standard local endpoint (http://localhost:4318), accepting all data immediately.
- Visualization Engine: Once our application sends data, the Toolkit processes the raw OTLP stream and renders it as an interactive, intuitive waterfall diagram .
- Actionable Debugging: We gain immediate benefits:
- Visualize the Hierarchy: See the parent-child relationship between our custom spans (chat_interaction) and the auto-instrumented LLM spans.
- Analyze Attributes: Click any span, and the Toolkit instantly displays the rich metadata, letting us confirm the user_query and model version for that specific request
With these foundational aspects in place, it's time to get hands-on and move on to the exciting coding phase!
Firstly install the libraries,
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-http opentelemetry-instrumentation-openai-v2 streamlit openai python-dotenv
To begin the coding phase and set up the foundation for tracing and our chat application, we first need to import the necessary Python libraries.
import os
from dotenv import load_dotenv
# --- OpenTelemetry Core and SDK Imports ---
from opentelemetry import trace, _events
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk._events import EventLoggerProvider
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
Initialize the OpenTelemetry tracing and logging providers and configure them to send data to a local collector, while simultaneously setting up automatic tracing for OpenAI compatible API calls. The following code is the complete foundational setup for OpenTelemetry in our Generative AI application, configuring it to automatically trace OpenAI compatible calls and export all telemetry data to a local collector like the AI Toolkit.
# Load environment variables from a .env file (e.g., API keys)
load_dotenv()
# This environment variable tells the GenAI instrumentor to capture the full text of user prompts and model responses within the trace data.
os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
# --- 1. Define Service Identity ---
# Resource gives the service a unique name which is essential for filtering and viewing the data in the AI Toolkit.
resource = Resource(attributes={
"service.name": "streamlit-chat-app"
})
# --- 2. Configure Tracing Pipeline (Spans) ---
# TracerProvider is the central factory for all tracing operations.
provider = TracerProvider(resource=resource)
# OTLPSpanExporter packages and sends trace data using the OTLP protocol.
# The endpoint points to the AI Toolkit's local collector for traces.
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
# BatchSpanProcessor efficiently groups multiple spans together before exporting them.
processor = BatchSpanProcessor(otlp_exporter)
provider.add_span_processor(processor)
# Sets this configuration as the global tracer provider for the application.
trace.set_tracer_provider(provider)
# --- 3. Configure Logging/Event Pipeline ---
# LoggerProvider manages logs and custom events, using the same service identity.
logger_provider = LoggerProvider(resource=resource)
# Configures a processor to batch and export logs via OTLP to the collector's log endpoint.
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4318/v1/logs"))
)
# Configures the event logging system for recording time-stamped events within spans.
_events.set_event_logger_provider(EventLoggerProvider(logger_provider))
# --- 4. Configure Auto-Instrumentation ---
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
# This crucial line automatically hooks into the OpenAI library, generating detailed
# spans for all API calls (latency, token usage, etc.) without writing manual tracing code.
OpenAIInstrumentor().instrument()
print("Tracing configured")
Next part of code snippet contains the application logic for our chat interface, implementing the Streamlit front-end and, crucially, using the OpenTelemetry tracer to wrap your business logic and GenAI calls in meaningful traces.
This code powers the Streamlit chat application. After connecting to the local LLM via the OpenAI client, it listens for user input.
import streamlit as st
from openai import OpenAI
# Get the tracer instance configured in the previous step (__name__ helps identify the source module)
tracer = trace.get_tracer(__name__)
# Initialize the OpenAI client connection
client = OpenAI(
# Base URL points to the local server hosting the model (e.g., using Ollama or another local setup)
base_url="http://127.0.0.1:53491/v1/",
# API key is often ignored by local servers but required for the client constructor
api_key="xyz"
)
# --- Streamlit UI Setup ---
st.title("Chat with GenAI Model")
# Creates an input box at the bottom of the chat interface
query = st.chat_input("Enter query:")
# Check if the user has submitted a query
if query:
# --- Custom Tracing: Root Span ---
# Creates the top-level span (the 'chat_interaction') for the entire operation.
# The 'with' statement ensures the span automatically closes when the block exits.
with tracer.start_as_current_span("chat_interaction") as span:
# Adds contextual data (attributes) to the root span, making it searchable in the AI Toolkit.
span.set_attribute("user_query", query)
span.set_attribute("model", "qwen2.5-1.5b-instruct-generic-cpu:3")
# Display the user's query in the chat history
with st.chat_message("user"):
st.write(query)
# --- Custom Tracing: Child Span (LLM Computation) ---
# Creates a nested span specifically to isolate and measure the duration of the core LLM processing time.
with tracer.start_as_current_span("llm_completion") as llm_span:
# The client.chat.completions.create() call itself will generate a third,
# *automatic* OTel span nested inside this 'llm_completion' span.
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant and provides structured answers."},
{"role": "user", "content": query}
],
model="qwen2.5-0.5b-instruct-cuda-gpu:3",
)
response_text = chat_completion.choices[0].message.content
# Add attributes to the LLM span after the response is processed
llm_span.set_attribute("response_length", len(response_text))
# Records a specific time point (event) within the span's lifecycle
llm_span.add_event("Response generated")
# Display the model's response in the chat history
with st.chat_message("assistant"):
st.write(response_text)
# Records an event at the end of the entire user interaction
span.add_event("Chat interaction complete")
print(" Check AI Toolkit --> Tracing for telemetry data")
When a query is received, it executes the critical tracing steps:
- Root Span (chat_interaction): A top-level span is created to measure the total time of the user's request. Key attributes like user_query and the intended model are attached here for context.
- Child Span (llm_completion): A nested span measures the duration of the server-side LLM call.
- Automatic Span: The client.chat.completions.create function (nested inside the child span) automatically generates a third, highly detailed span (with token usage, cost, etc.) thanks to the OpenAI Instrumentor.
Finally, the code extracts and displays the model's response_text, completing the user experience and the trace.
We have everything we need to implement this! Now for the exciting part: testing the application. But first, two quick checks before running the code:
- Make sure the AI Toolkit is launched and the model is accessible via the endpoint specified in the code. (For Foundry Local users, ensure the backend models are actively serving.)
- Open the Tracing tab in the Toolkit and hit Start Collector so that we can successfully capture the telemetry data.
To execute the code, firstly save the file as "app.py" (or any name of the file followed by .py) and then open the terminal in the VS Code, make sure that the libraries are installed and the virtual environment is setup.
Now run the following command,
streamlit run app.py #<filename>.py
Upon execution, the Streamlit application will automatically open in the default web browser,
StreamlitWhen a question is asked, it works like any other basic Generative AI application powered by edge/on-premise/local models, providing a suitable answer to the query. But now there is a significant difference in the backend. Unlike before, where it was a black box, the application now shows the entire trace of the call. Let’s switch to the VS Code AI Toolkit to visualize this.
Tracing : TraceThe terminal shows a log, and the tracing window displays the trace of the application, including the queried question. Upon clicking the name 'chat_completion' on the trace, a detailed window appears with details like Span and Metadata.
Tracing: SpanNow we can evaluate each and every span and understand what were crucial parameters involved in the complete Lifecyle of each query.
In this tutorial we have implemented one with custom spans. The code will be available on AI_Toolkit_Samples. There will be other versions with Azure Inference SDK, including an advanced application.
We’ve successfully transformed our opaque chat application into an observable service using the Open Telemetry standard and the visualization power of the AI Toolkit!
The Value of Tracing
By implementing custom spans like chat_interaction and embedding contextual attributes (e.g., user_query), we achieved granular insight into our Generative AI performance.The trace waterfall diagram immediately tells us where time is spent. From our scenario now we know, If the llm_completion span is the bottleneck, we know to optimize the model, not the UI.
This tracing foundation is the launchpad for advanced observability. The data we collect now enables cost optimization (by tracking tokens per span), sophisticated A/B testing (by filtering on the model attribute), and the creation of reliable, proactive performance alerts.
Tracing is a definitive roadmap for building faster, more efficient, and fully transparent GenAI applications. In the further blogs we will explore on how to implement this in complex applications especially involving RAG and Agentic frameworks.