Choosing the Right Tool: A Comparative analysis of the Assistants API & Chat Completions API

Microsoft

May 15, 2024

Intro

In the evolving landscape of artificial intelligence (AI), the rate of innovation is producing many new technologies and frameworks to aid in the development of AI solutions. OpenAI is a leader in this space and provides different building blocks. Among its array of offerings, the Assistants API and the Chat Completions API, could be used as the foundation for building your AI solutions.

As developers seek to integrate AI-driven conversational interfaces into their applications, understanding the nuances between these two APIs becomes paramount. While both serve the overarching goal of facilitating human-computer interaction, they do so through different mechanisms, each tailored to specific use cases and requirements.

In this post, we delve into a comparative analysis of the Assistants API and the Chat Completions API, exploring their features, functionalities, and optimal scenarios for deployment. Whether you're embarking on a complex AI project necessitating intricate context management or aiming for streamlined interactions in simpler applications, this exploration aims to equip you with the insights needed to make informed decisions regarding API selection and implementation.

High Level Overview of Both APIs

Assistants API

The Assistants API is a powerful tool available on Azure OpenAI that enables developers to create sophisticated AI assistants within their applications. Key features include:

Instructions: Developers can provide specific instructions to tailor the personality and capabilities of the assistant.
Tools: Assistants can leverage various tools, including those hosted by OpenAI (such as Code Interpreter and Knowledge Retrieval) or custom-built tools hosted externally.
Threads: Assistants can access persistent threads, allowing them to maintain context across multiple interactions. Threads store messages and automatically handle content truncation to fit within the model's context window limit.
Files: Assistants have access to files in different formats, either during their creation or within conversation threads with users.
Advanced Features: The Assistant API offers advanced features such as conversation threading, code execution, and data retrieval, making it suitable for applications requiring detailed context management and prolonged conversations.
Independence: Each assistant can initiate and manage multiple independent message threads, enhancing its multitasking capabilities.
Limitations: Notably, the Assistants API does not offer model controls on things like top_p and temperature, which may affect the variability and creativity of responses.

Overall, the Assistants API streamlines conversation history management, provides access to OpenAI-hosted tools, and supports improved function-calling for third-party tools. It is designed to empower developers in building robust AI assistants capable of performing a wide range of tasks within their applications.

Chat Completions API

The Chat Completions API, another offering available on Azure OpenAI, serves a different purpose compared to the Assistant API. Key characteristics of the Chat Completions API include:

Response Generation: The Chat Completions API generates responses for a given dialog based on the provided message history. It requires input in a specific format corresponding to the conversation context.
Agility: It is more suitable for agile and direct responses, making it ideal for scenarios where quick, straightforward interactions are preferred.
Efficiency: The Chat Completions API is lightweight and efficient, making it suitable for simple AI applications where resource consumption is a concern.

In essence, the Chat Completions API provides a streamlined solution for generating responses in dialog-based interactions. While it may lack the advanced features and context management capabilities of the Assistant API, it excels in scenarios where simplicity, efficiency, agility, and customization are paramount.

Evaluation Criteria

To evaluate between both APIs, we will examine the following factors:

Initial Setup Complexity – Effort required to set up and start using the API
Capabilities – Functionalities offered within the API
Customizability – How customizable the use of the API is
Scalability – Performance at Scale
Cost – Cost of using the API
HA/DR – Ability to avoid/recover from failures

Initial Setup Complexity

Initial setup complexity refers to the effort required to set up and start using the APIs. To use the Chat Completions API, you need to instantiate a “client”, pass it the proper parameters, then use that client to infer against the specified GPT-family of models. The prompts are simply parameters, and the model responses can be parsed out of the complete JSON API response. Below is a sample code snippet on what the API call would look like for the Chat Completions API.

import os
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version = "2024-02-01",
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

response = client.chat.completions.create(
    model="gpt-35-turbo", # model = "deployment_name".
    messages=[
        {"role": "system", "content": "Assistant is a large language model trained by OpenAI."},
        {"role": "user", "content": "Who were the founders of Microsoft?"}
    ]
)

#print(response)
print(response.model_dump_json(indent=2))
print(response.choices[0].message.content)

The initial set up for the Assistants API requires more logic than the Chat Completions API. This is because the Assistants API introduces the concept of a “thread.” A thread is a conversation session between an Assistant and a user. Threads are persisted within the Assistant object and store messages from the user. These threads are automatically truncated to fit within the model’s context window. Threads need to be run to get a model response. Because a “thread run” is an async process, the run status needs to be polled for a “completed” response. Once the thread run is completed, you can list the contents of the thread to retrieve the actual model response. Below is an example of what this would look like.

import os
import time
import json
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-15-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

# Create an assistant
assistant = client.beta.assistants.create(
    name="Math Assist",
    instructions="You are an AI assistant that can write code to help answer math questions.",
    tools=[{"type": "code_interpreter"}],
    model="gpt-4-1106-preview" #You must replace this value with the deployment name for your model.
)

# Create a thread
thread = client.beta.threads.create()

# Add a user question to the thread
message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="I need to solve the equation `3x + 11 = 14`. Can you help me?"
)

# Run the thread
run = client.beta.threads.runs.create(
  thread_id=thread.id,
  assistant_id=assistant.id,
)

# Retrieve the status of the run
run = client.beta.threads.runs.retrieve(
  thread_id=thread.id,
  run_id=run.id
)

status = run.status

# Wait till the assistant has responded
while status not in ["completed", "cancelled", "expired", "failed"]:
    time.sleep(5)
    run = client.beta.threads.runs.retrieve(thread_id=thread.id,run_id=run.id)
    status = run.status

messages = client.beta.threads.messages.list(
  thread_id=thread.id
)

print(messages.model_dump_json(indent=2))

On the surface the Assistants API does look significantly more complicated than the Chat Completions API, however, if you account for what the Assistants API offers “out of the box” it could actually be less complex overall. For example, I mentioned earlier that “threads” are automatically truncated and persisted within an Assistant. To emulate this same functionality with Chat Completions API, one would need to set up some sort of structure to encapsulate the prompts and responses. This can be done as simply as using a list data structure or as complicated as using a relational database. After the prompts and responses are accounted for, then logic would have to be written to account for truncation when necessary. The same applies for the remaining built-in features that the Assistants API has out of the box.

Built-in Capabilities & Customization

Speaking of these built-in features, the Assistants API offers significantly more out of the box tools and functionality compared to the Chat Completions API. The Assistants API offers data retrieval, built-in orchestration, and a code execution environment. With the Chat Completions API, developers could offer the same set of functionalities, however, it would need to be developed into the application. Popular open-source tools such as Semantic Kernel, LangChain, and Open Interpreter, have made it possible to build data retrieval, LLM orchestration, and code execution into your AI applications. For more information on these tools, feel free to follow the links above.

One key feature present in both APIs is “function calling”. Function calling allows your LLM to interact with predefined functions (or tools) that can be used to interact with external APIs or systems. For example, a user can define a function to “get_weather”. This function would take a location parameter, call out to a weather retrieval API, and return the response. Both APIs require the developers to define and describe this function to the LLM using the standardized function definition format. The reasoning capability of the LLM is then used to decide when to call that function.

An example of this interaction would be a user chatting with an AI application and asking, “what's the weather?”. The LLM powering that AI application, knowing it has access to a pre-defined function to “get_weather,” can “call” that function, retrieve its response, and pass that back to the user. This process conceptually works the same for both the Assistant and Chat Completions APIs. I use the word “call” loosely because the main difference in function calling between the two APIs is that the Chat Completions API will never actually execute the pre-defined function for you. Because it does not have access to an execution environment, all the Chat Completions API can do does is return the function that should be called and generate the formatted function call. The logic to execute, parse, and re-submit that function response to the LLM for further action would need to be developed within the application. The Assistants API, on the other hand, works similarly where it still “decides” which function is appropriate to call but, in some cases, can execute the function itself because it has access to an execution environment, the code interpreter. The key difference is execution vs suggestion.

Scalability

In terms of scalability, both APIs are suitable for production workloads. The key differentiator is that the Assistants API is more of a closed system while the Chat Completions API allows for more customization. The Assistants API is designed to manage multiple concurrent conversations efficiently, thanks to its advanced features like persistent threads and context management. This makes it highly scalable for complex, multi-user applications that require maintaining state or context across interactions. While the Chat Completions API lacks built-in context management, its design is inherently scalable and is only limited by the latency of the model.

Cost

The cost difference between the two APIs is nominal. Both APIs can leverage any of the recent GPT-family of models. It is recommended, however, to use the Assistants API with GPT-4-Turbo as you are likely to get better performance. The Assistants API also charges for the code interpreter. This is an hourly charge and continues to accrue if you continue to send tasks to the code interpreter. For more information on pricing please see our Pricing Page.

HA/DR

When considering the high availability and disaster recovery of these APIs, it is important to note that both services are built on robust infrastructure which include redundancies and failover mechanisms to maintain service continuity. When referring to HA/DR in this section, I am referring to the HA/DR of the application using these APIs, not the APIs themselves. With the Assistants API, even though threads are persisted, they are tied to a specific Assistant object. This Assistant object is bounded to a deployed instance of an AOAI service. This means that to fail an Assistant over to another region and maintain the history, you would need to first create another instance of the service, then another Assistant object, then export all the messages from the original thread and externally persist them, somewhere like a database. From there, these messages need to be inserted back into the new Assistant thread. With the chat completions API, because it is inherently stateless, you would have already had to persist the prompts/responses outside of the API. This allows for a significantly simpler failover process as it would be as simple as routing requests to a different endpoint.

It is also worth noting that not all solutions require fault tolerance where zero message loss is the goal. HA/DR implementations need to weigh in the cost of message persistence. Your recovery point and recovery time objective should be considered to understand what is acceptable.

Conclusion

Now that we have explored the nuances between these two API, we can see that while both are great tools, one is more capability dense but also more of a closed box in terms of customization ability. The Assistant API is a robust API for creating sophisticated AI solutions. With access to built-in tools like code interpreter, the Assistants API simplifies deterministic tasks like data analysis or code development. The Chat Completions API, on the other hand, offers agility and efficiency catering to scenarios where simplicity and complete control of the architecture is important. When evaluating between these two API, developers most consider the factors discussed above. Ultimately, the choice between them hinges on the specific requirement and objectives of the desired solution. By carefully evaluating the comparative analysis we presented here, developers can make informed decisions regarding their API selection and implementation.

Published May 15, 2024

Version 1.0

azure ai services

WinnieNwanne

Microsoft

Joined December 18, 2023

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity