Advice on how to deploy API (Scaleable real time inference using GPU, no time out)

Copper Contributor

Hi,

 

I am hoping to run my python code that uses large AI models over large videos. Please can someone best advise me on the best Azure technology to use.

 

Currently I hosted my FastAPI python application on a web app (CPU), and now I am facing challenges since the requests need to be complete within 230 seconds. We have a microservices architecture and can easily deploy different API's (i.e long running tasks on different infrastructure).

 

However, the one task that takes extremely long, can't really be batched or broken down effectively meaning Durable functions are not an options. In addition the code must run on a GPU.

 

Naturally, this makes me think AKS could be an option. However, I see online that you can use a NVIDIA Triton server via Azure ML for real time inference

 

Durable functions are not an option

 

The requirements are:

- User can upload a video from our website (1080p) up to 1 hour long.

- GPU processes video extracting information

- Processing may take longer than, 230 seconds, there shouldn't be a time out for long request

- Scalable - Handle multiple requests from multiple users. Right now a 1 hour video can be processed in about 3 mins MAX with a large model, I'm guessing based on location, internet, and number of requests being sent this could extend.

 

Sometimes I am surprised that companies can open source chat gpt and have millions of users make requests at once, how are they building these real time inference systems!

 

Thanks,

 

 

1 Reply
I believe you may be misunderstanding the capabilities of Durable functions. Activity, orchestrator, and entity functions are subject to the same function timeouts as all Azure Functions. https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#timeout shows these timeouts and as you can see there a Premium or Dedicated plan allows for an unlimited execution period. The trigger starts function execution, the function needs to return/respond within the timeout duration. However, a single request, regardless of the function app timeout setting, has 230 seconds (the maximum amount of time that an HTTP triggered function can take) to respond to a request. Beyond that, you need to consider using the Durable Functions async pattern. You should be able to leverage that with long processing times without an issue, but your client needs to leverage async status calls. https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csha...

Obviously AKS has intrinsic GPU support, but you will need to investigate the VMSS SKU's available for use there in comparison to your existing Triton workloads. Your mileage may vary with the new isolated worker process support for Functions as well. See https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-dotnet-isolated-ov...