Azure OpenAI Service is at the forefront of technological innovation, offering REST API access to OpenAI's suite of revolutionary language models, including GPT-4, GPT-35-Turbo, and the Embeddings model series.
Enhancing Throughput for Scale
As enterprises seek to deploy OpenAI's powerful language models across various business units, they often require granular control over configuration and performance metrics. To address this need, Azure OpenAI Service is introducing dedicated throughput, a feature that provides a dedicated connection to OpenAI models with guaranteed performance levels. Throughput is quantified in terms of tokens per second (tokens/sec), allowing organizations to precisely measure and optimize the performance for both prompts and completions. The model of provisioned throughput provides enhanced management and adaptability for varying workloads, guaranteeing system readiness for spikes in demand. This capability also ensures a uniform user experience and steady performance for applications that require real-time responses.
Resource Sharing and Chargeback Mechanisms
Large organizations frequently provision a singular instance of Azure OpenAI Service that is shared across multiple internal departments. This shared use necessitates an efficient mechanism for allocating costs to each business unit or consumer, based on the number of tokens consumed. This article delves into how chargeback is calculated for each business unit based on their token usage.
Leveraging Azure API Management Policies for Token Tracking
Azure API Management Policies offer a powerful solution for monitoring and logging the token consumption for each internal application. The process can be summarized in the following steps:
** Sample Code: Refer to this GitHub repository to get a step-by-step instruction on how to build the solution outlined below : private-openai-with-apim-for-chargeback
1. Client Applications Authorizes to API Management
To make sure only legitimate clients can call the Azure OpenAI APIs, each client must first authenticate against Azure Active Directory and call APIM endpoint. In this scenario, the API Management service acts on behalf of the backend API, and the calling application requests access to the API Management instance. The scope of the access token is between the calling application and the API Management gateway. In API Management, configure a policy (validate-jwt or validate-azure-ad-token) to validate the token before the gateway passes the request to the backend.
2. APIM redirects the request to OpenAI service via private endpoint.
Upon successful verification of the token, Azure API Management (APIM) routes the request to Azure OpenAI service to fetch response for completions endpoint, which also includes prompt and completion token counts.
3. Capture and log API response to Event Hub
Leveraging the log-to-eventhub policy to capture outgoing responses for logging or analytics purposes. To use this policy, a logger needs to be configured in the API Management:
# API Management service-specific details
$apimServiceName = "apim-hello-world"
$resourceGroupName = "myResourceGroup"
# Create logger
$context = New-AzApiManagementContext -ResourceGroupName $resourceGroupName -ServiceName $apimServiceName
New-AzApiManagementLogger -Context $context -LoggerId "OpenAiChargeBackLogger" -Name "ApimEventHub" -ConnectionString "Endpoint=sb://<EventHubsNamespace>.servicebus.windows.net/;SharedAccessKeyName=<KeyName>;SharedAccessKey=<key>" -Description "Event hub logger with connection string"
Within outbound policies section, pull specific data from the body of the response and send this information to the previously configured EventHub instance. This is not just a simple logging exercise; it is an entry point into a whole ecosystem of real-time analytics and monitoring capabilities:
<outbound>
<choose>
<when condition="@(context.Response.StatusCode == 200)">
<log-to-eventhub logger-id="TokenUsageLogger">@{
var responseBody = context.Response.Body?.As<JObject>(true);
return new JObject(
new JProperty("Timestamp", DateTime.UtcNow.ToString()),
new JProperty("ApiOperation", responseBody["object"].ToString()),
new JProperty("AppKey", context.Request.Headers.GetValueOrDefault("Ocp-Apim-Subscription-Key",string.Empty)),
new JProperty("PromptTokens", responseBody["usage"]["prompt_tokens"].ToString()),
new JProperty("CompletionTokens", responseBody["usage"]["completion_tokens"].ToString()),
new JProperty("TotalTokens", responseBody["usage"]["total_tokens"].ToString())
).ToString();
}</log-to-eventhub>
</when>
</choose>
<base />
</outbound>
EventHub serves as a powerful fulcrum, offering seamless integration with a wide array of Azure and Microsoft services. For example, the logged data can be directly streamed to Azure Stream Analytics for real-time analytics or to Power BI for real-time dashboards With Azure Event Grid, the same data can also be used to trigger workflows or automate tasks based on specific conditions met in the incoming responses.
Moreover, the architecture is extensible to non-Microsoft services as well. Event Hubs can interact smoothly with external platforms like Apache Spark, allowing you to perform data transformations or feed machine learning models.
4: Data Processing with Azure Functions
An Azure Function is invoked when data is sent to the EventHub instance, allowing for bespoke data processing in line with your organization’s unique requirements. For instance, this could range from dispatching the data to Azure Monitor, streaming it to Power BI dashboards, or even sending detailed consumption reports via Azure Communication Service.
[Function("TokenUsageFunction")]
public async Task Run([EventHubTrigger("%EventHubName%", Connection = "EventHubConnection")] string[] openAiTokenResponse)
{
//Eventhub Messages arrive as an array
foreach (var tokenData in openAiTokenResponse)
{
try
{
_logger.LogInformation($"Azure OpenAI Tokens Data Received: {tokenData}");
var OpenAiToken = JsonSerializer.Deserialize<OpenAiToken>(tokenData);
if (OpenAiToken == null)
{
_logger.LogError($"Invalid OpenAi Api Token Response Received. Skipping.");
continue;
}
_telemetryClient.TrackEvent("Azure OpenAI Tokens", OpenAiToken.ToDictionary());
}
catch (Exception e)
{
_logger.LogError($"Error occured when processing TokenData: {tokenData}", e.Message);
}
}
}
In the example above, Azure function processes the tokens response data in Event Hub and sends them to Application Insights telemetry, and a basic Dashboard is configured in Azure, displaying the token consumption for each client application. This information can conveniently be used to compute chargeback costs.
A sample query used in dashboard above that fetches tokens consumed by a specific client:
customEvents
| where name contains "Azure OpenAI Tokens"
| extend tokenData = parse_json(customDimensions)
| where tokenData.AppKey contains "your-client-key"
| project
Timestamp = tokenData.Timestamp,
Stream = tokenData.Stream,
ApiOperation = tokenData.ApiOperation,
PromptTokens = tokenData.PromptTokens,
CompletionTokens = tokenData.CompletionTokens,
TotalTokens = tokenData.TotalTokens
Azure OpenAI Landing Zone reference architecture
A crucial detail to ensure the effectiveness of this approach is to secure the Azure OpenAI service by implementing Private Endpoints and using Managed Identities for App Service to authorize access to Azure AI services. This will limit access so that only the App Service can communicate with the Azure OpenAI service. Failing to do this would render the solution ineffective, as individuals could bypass the APIM/App Service and directly access the OpenAI Service if they get hold of the access key for OpenAI. Refer to Azure OpenAI Landing Zone reference architecture to build a secure and scalable AI environment.
Additional Considerations
- If the client application is external, consider using an Application Gateway in front of the Azure APIM
- If "streaming" is set to true, tokens count is not returned in response. In that that case libraries like tiktoken (Python), orgpt-3-encoder(javascript) for most GPT-3 models can be used to programmatically calculate tokens count for the user prompt and completion response. A useful guideline to remember is that in typical English text, one token is approximately equal to around 4 characters. This equates to about three-quarters of a word, meaning that 100 tokens are roughly equivalent to 75 words. (P.S. Microsoft does not endorse or guarantee any third-party libraries.)
- A subscription key or a custom header like app-key can also be used to uniquely identify the client as appId in OAuth token is not very intuitive.
- Rate-limiting can be implemented for incoming requests using OAuth tokens or Subscription Keys, adding another layer of security and resource management.
- The solution can also be extended to redirect different clients to different Azure OpenAI instances. For example., some clients utilize an Azure OpenAI instance with default quotas, whereas premium clients get to consume Azure Open AI instance with dedicated throughput.
Conclusion
Azure OpenAI Service stands as an indispensable tool for organizations seeking to harness the immense power of language models. With the feature of provisioned throughput, clients can define their usage limits in throughput units and freely allocate these to the OpenAI model of their choice. However, the financial commitment can be significant and is dependent on factors like the chosen model's type, size, and utilization. An effective chargeback system offers several advantages, such as heightened accountability, transparent costing, and judicious use of resources within the organization.