In the domain of AI development, comprehending the inner workings of applications presents a considerable challenge. The AI Toolkit addresses this by offering robust tracing capabilities, which are designed to assist in monitoring and analyzing the performance of AI applications. By facilitating the tracing of an application's execution—including its interactions with generative AI models—the toolkit provides a means to acquire critical insights into its behavior and pinpoint performance issues. This capability is pivotal for developers aiming to go beyond the "black box" and construct more dependable, high-performing AI systems.
Technology is evolving at an incredible pace, and now the world is in the midst of another revolution: Generative AI. This isn't just for tech experts anymore; it's a field that has truly democratized AI, putting powerful capabilities into the hands of everyone. From creating art to drafting emails, people and businesses are adopting these applications faster than ever.
But as with any powerful technology, there are challenges. One might have experienced a frustrating delay when a chatbot is "thinking" or a smart assistant takes too long to respond. That's latency, and it's a common problem with applications that rely on large language models. The constant stream of API calls and high token usage can make these tools sluggish and expensive to run. So, how can we ensure these amazing tools are optimised for speed and affordable? The answer lies in a technique called Tracing. In this blog, the we will dive into what tracing is, how it works, and how it can help build more efficient and reliable AI applications.
Tracing is the process of monitoring and logging the step-by-step execution of an LLM's workflow, including inputs, intermediate steps (like model calls and tool usage), and final outputs. It provides developers with a granular view into the LLM's decision-making process, enabling them to debug issues, optimize performance and costs, and understand how a particular response was generated. Tracing captures details like token usage, latency, model parameters, retrieved documents, and function calls within a detailed timeline of the LLM's request flow.
It's like a flight recorder for the code, offering a clear view into the inner workings of application, from the user's initial request to the final generated response. For any developer or organization building with LLMs, understanding and implementing tracing is no longer a luxury—it's a necessity.
There are many reasons of why tracing is important, but in this blog lets discuss a few important aspects to highlight tracing is absolutely essential for building robust, efficient, and cost-effective LLM applications:
- Debugging broken chains: Modern LLM applications often involve complex pipelines with multiple components, such as prompt templates, retrieval systems, and custom tools. Tracing helps pinpoint the exact point of failure in these pipelines, whether it's a bad prompt or a tool error.
- Performance optimization: Tracing reveals slow LLM calls, inefficient token usage, and excessive retrieval times, allowing developers to identify and address performance bottlenecks.
- Cost management: By providing a breakdown of token usage for each model call, tracing helps identify the most expensive parts of an LLM application, enabling optimization for cost savings.
- Understanding model behavior: Tracing provides insights into the parameters, prompts, and tool descriptions used by the LLM, offering a clear understanding of how the model arrived at its response.
- Auditing and evaluation: For sensitive applications, tracing offers a detailed audit trail of how requests were processed, which is crucial for ensuring compliance and for systematically evaluating and enhancing the quality of the LLM application.
What Does Tracing Record?
Tracing system captures key data points that are vital for analysis:
- Model Calls: The exact prompts, parameters, and outputs of every interaction with the LLM.
- Token Usage: The precise number of input and output tokens, offering a direct link to the cost and efficiency of each LLM invocation.
- Tool Usage: A record of which tools were called, their inputs, outputs, and their descriptions. This is especially critical for applications using function-calling.
- Retrieved Documents: In a Retrieval Augmented Generation (RAG) process, tracing logs the documents that were retrieved, including their relevance scores and the order in which they were processed.
- Runtime Exceptions: Any critical errors, such as API rate-limiting issues or connection problems, are logged, providing immediate alerts to potential failures.
Tracing in AI Toolkit
Tracing features within AI Toolkit help users monitor and analyze the performance of their AI applications. The toolkit enables the tracing of an application's execution, including its interactions with generative AI models, to provide insights into behavior and performance.
At the heart of AI Toolkit's tracing is a local HTTP and gRPC server. This server acts as a collector, listening for trace data generated by the application. What makes this particularly powerful is its compatibility with Open Telemetry Protocol (OTLP). OTLP has become the industry standard for collecting telemetry data, ensuring that AI Toolkit can integrate with a wide range of development environments and SDKs.
The magic happens under the hood with Open Telemetry. Many modern language model SDKs and frameworks either directly support OTLP or have community-driven libraries that enable its use. This means application can be instrumented to send trace spans—rich data points that represent a single operation—to the AI Toolkit's collector.
Crucially, AI Toolkit supports all frameworks that not only use OTLP but also follow the semantic conventions for generative AI systems. These conventions standardize the way trace data is structured for AI applications, ensuring that information like model name, token counts, and latency are captured consistently across different models and providers. This consistency is vital for providing a unified and meaningful visualization of the application's performance.
Tracing : VisualizationDue to its reliance on OTLP, AI Toolkit is highly compatible. Many common AI SDKs have been tested and are supported. This broad compatibility means that whether we are using a major framework or a more niche library, as long as it supports OTLP and the necessary semantic conventions, one can leverage the full power of AI Toolkit tracing.
Tracing: CompatibilityBy visualizing the collected instrumentation data, we can:
- Pinpoint Performance Bottlenecks: See exactly which part of AI pipeline is slow, whether it's the model inference or a pre-processing step.
- Debug Unexpected Behaviour: Trace the sequence of events that led to a specific or undesirable output from a generative model.
- Gain Deeper Insights: Understand token usage, latency, and other key metrics for each interaction with AI models.
Lets now see Tracing in action using AI Toolkit,
If you are unfamiliar with the AI Toolkit are encouraged to go through the provided link.
- Launch the AI Toolkit extension and find the Tracing option under “ Agent and Workflow Tools” section. Click on this to have the view of the Tracing window.
- Click on the “Start Collector” button to start the local OTLP trace collector server.Tracing: AI Toolkit
-
Tracing is enabled by first instrumenting code with a provided snippet, as detailed in the Set up instrumentation section for various languages and SDKs. The following is a sample code in case of using the OpenAI SDK,
from opentelemetry import trace, _events
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk._events import EventLoggerProvider
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
import os
os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
# Set up resource
resource = Resource(attributes={
"service.name": "opentelemetry-instrumentation-openai"
})
# Set up tracer provider
trace.set_tracer_provider(TracerProvider(resource=resource))
# Configure OTLP exporter
otlp_exporter = OTLPSpanExporter(
endpoint="http://localhost:4318/v1/traces"
)
# Add span processor
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Set up logger provider
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://localhost:4318/v1/logs"))
)
_events.set_event_logger_provider(EventLoggerProvider(logger_provider))
# Enable OpenAI instrumentation
OpenAIInstrumentor().instrument()
Tracing: Server
- Upon successful start of the collector we can see two endpoints of gRPC and HTTP. These are useful and will be used in the code as well.
- Trace data is generated by running the application.
- To view new trace data, one must select the Refresh button within the tracing WebView.
In the webview preview, there is some major data like,
- Status
- Name of the model
- Input
- Output
- Start time
- Duration
- Total Tokens
Upon clicking on the preview, one can deep dive and see the request data, request type and the details of output in the “Trace View” tab.
Tracing: Trace viewBuilding reliable and efficient LLM applications is becoming increasingly vital. As we've explored, tracing offers a powerful solution, moving us beyond guesswork and providing the deep visibility required to understand and optimize our applications. Visual Studio AI Toolkit provides a practical way to implement these tracing capabilities, a step that's especially critical for the complex, multi-step operations of agentic applications. The insights gained from tracing are the key to ensuring Generative AI is not just powerful but also performant, reliable, and cost-effective. We will be diving into these practical implementations in upcoming blogs.