Vaibhav Pandey is a Senior Cloud Solution Architect at Microsoft focused on building production‑ready Azure AI solutions with Azure Machine Learning and Microsoft Foundry, spanning agentic and multi‑agent architectures, RAG pipelines, and BYOM extensibility with strong security and governance controls.
Modern AI-powered applications running on Azure increasingly require flexibility in model choice. While managed model catalogs accelerate time to value, real-world enterprise applications often need to:
- Host open‑source or fine‑tuned models
- Deploy domain‑specific or regulated models inside a tenant boundary
- Maintain tight control over runtime environments and versions
- Integrate AI inference into existing application architectures
This is where Bring Your Own Model (BYOM) becomes a core architectural capability, not just an AI feature.
In this post, we’ll walk through a production-ready BYOM pattern for Azure applications, using:
- Azure Machine Learning as the model lifecycle and inference platform
- Azure-hosted applications (and optionally Microsoft Foundry) as the orchestration layer
The focus is on building scalable, governable AI-powered apps on Azure, not platform lock‑in.
We use SmolLM‑135M as a reference model. The same pattern applies to any open‑source or proprietary model.
Reference Architecture: Azure BYOM for AI Applications
At a high level, the responsibilities are clearly separated:
| Azure Layer | Responsibility |
|---|---|
| Azure Application Layer | API, app logic, orchestration, agent logic |
| Azure Machine Learning | Model registration, environments, scalable inference |
| Azure Identity & Networking | Authentication, RBAC, private endpoints |
Key principle:
Applications orchestrate. Azure ML executes the model.
This keeps AI workloads modular, auditable, and production-safe.
BYOM Workflow Overview
- Provision Azure Machine Learning
- Create Azure ML compute
- Author code in an Azure ML notebook
- Download and package the model
- Register the model
- Define a reproducible inference environment
- Implement scoring logic
- Deploy a managed online endpoint
- Use the endpoint from Microsoft Foundry
Step 1: Provision Azure Machine Learning
An Azure ML workspace is the governance boundary for BYOM:
- Model versioning and lineage
- Environment definitions
- Secure endpoint hosting
- Auditability
Choose region carefully for latency, data residency, and networking.
Step 2: Create Azure ML Compute (Compute Instance)
Create a Compute Instance in Azure ML Studio.
Why this matters:
- Managed Jupyter environment
- Identity integrated (no secrets in notebooks)
- Ideal for model packaging and testing
- Enable auto‑shutdown for cost control
- CPU is sufficient for most development workflows
Step 3: Create an Azure ML Notebook
- Open Azure ML Studio → Notebooks
- Create a new Python notebook
- Select the Python SDK v2 kernel
This notebook will handle the entire BYOM lifecycle.
Step 4: Connect to the Azure ML Workspace
# Import Azure ML SDK client
from azure.ai.ml import MLClient
# Import identity library for secure authentication
from azure.identity import DefaultAzureCredential
# Define workspace details
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<WORKSPACE_NAME>"
# Create MLClient using Microsoft Entra ID
# No keys or secrets are embedded in code
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id,
resource_group,
workspace_name
)
The code above uses enterprise identity and aligns with zero‑trust practices.
Step 5: Download and Package Model Artifacts
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
# Hugging Face model identifier
model_id = "HuggingFaceTB/SmolLM-135M"
# Local directory where model artifacts will be stored
model_dir = "smollm_135m"
os.makedirs(model_dir, exist_ok=True)
# Download model weights
model = AutoModelForCausalLM.from_pretrained(model_id)
# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Save artifacts locally
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)
🔹 Open‑source or proprietary models follow the same packaging pattern
🔹 Azure ML treats all registered models identically
Step 6: Register the Model in Azure ML
Register the packaged artifacts as a custom model asset. Optionally, developers can:
- Enables version tracking
- Supports rolling upgrades
- Integrates with CI/CD pipelines
This is the foundation for repeatable inference deployments.
from azure.ai.ml.entities import Model
# Create a model asset in Azure ML
registered_model = Model(
path=model_dir,
name="SmolLM-135M",
description="BYOM model for Microsoft Foundry extensibility",
type="custom_model"
)
# Register (or update) the model
ml_client.models.create_or_update(registered_model)
Step 7: Define a Reproducible Inference Environment
name: dev-hf-base
channels:
- conda-forge
dependencies:
- python=3.12
- numpy=2.3.1
- pip=25.1.1
- scipy=1.16.1
- pip:
- azureml-inference-server-http==1.4.1
- inference-schema[numpy-support]
- accelerate==1.10.0
- einops==0.8.1
- torch==2.0.0
- transformers==4.55.2
⚠️ Environment management is the hardest part of BYOM
✅ Treat environment changes like code changes
BYOM Inference Patterns
The same model can expose multiple behaviors.
Pattern 1: Text Generation Endpoint
This is the most common pattern for AI-powered applications:
- REST-based text generation
- Stateless inference
- Horizontal scaling through Azure ML managed endpoints
Ideal for:
- Copilots
- Chat APIs
- Summarization or content generation services
Scoring Script (score.py)
import os
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def init():
"""
Called once when the container starts.
Loads the model and tokenizer into memory.
"""
global model, tokenizer
# Azure ML injects model path at runtime
model_dir = os.getenv("AZUREML_MODEL_DIR")
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir)
model.eval()
def run(raw_data):
"""
Called for each inference request.
Expects JSON input with a 'prompt' field.
"""
data = json.loads(raw_data)
prompt = data.get("prompt", "")
# Tokenize input text
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text without tracking gradients
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
# Decode output tokens into text
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response_text}Example Request
{
"prompt": "Summarize the BYOM pattern in one sentence."
}
Example Response
{
"response": "Bring Your Own Model (BYOM) allows organizations to extend Microsoft Foundry with custom models hosted on Azure Machine Learning while maintaining enterprise governance and scalability."
}
Pattern 2: Predictive / Token Rank Analysis
The same model can expose non-generative behaviors, such as:
- Token likelihood analysis
- Ranking or scoring
- Model introspection services
This enables AI-backed analytics capabilities, not just chat.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class PredictiveAnalysisModel:
"""
Computes the rank of each token based on the model's
next-token probability distribution.
"""
def init(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.model.eval()
def analyze(self, text):
tokens = self.tokenizer.tokenize(text)
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
# Start with BOS token
input_sequence = [self.tokenizer.bos_token_id, *token_ids]
results = []
for i in range(len(token_ids)):
context = input_sequence[: i + 1]
model_input = torch.tensor([context])
with torch.no_grad():
outputs = self.model(model_input)
logits = outputs.logits[0, -1]
sorted_indices = torch.argsort(logits, descending=True)
actual_token = token_ids[i]
rank = (sorted_indices == actual_token).nonzero(as_tuple=True)[0].item()
results.append({
"token": tokens[i],
"rank": rank
})
return results
@classmethod
def from_disk(cls, model_path):
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
return cls(model, tokenizer)
Scoring Script (score.py)
import os
from predictive_analysis import PredictiveAnalysisModel
def init():
"""
Loads predictive analysis model from disk.
"""
global model
model_dir = os.getenv("AZUREML_MODEL_DIR")
model = PredictiveAnalysisModel.from_disk(model_dir)
def run(text: str):
"""
Accepts raw text input and returns token ranks.
"""
return {
"token_ranks": model.analyze(text)
}Example Request
{
"text": "This is a test."
}
Example Response
{
"token_ranks": [
{ "token": "This", "rank": 518 },
{ "token": " is", "rank": 2 },
{ "token": " a", "rank": 0 },
{ "token": " test", "rank": 33 },
{ "token": ".", "rank": 77 }
]
}
Consuming the BYOM Endpoint from Azure Applications
Azure ML endpoints are external inference services consumed by apps.
Option A: Application-Controlled Invocation
- App calls Azure ML endpoint directly
- IAM, networking, and retries controlled by the app
- Recommended for most production systems
import requests
import os
AML_ENDPOINT = os.environ["AML_ENDPOINT"]
AML_KEY = os.environ["AML_KEY"]
headers = {
"Authorization": f"Bearer {AML_KEY}",
"Content-Type": "application/json"
}
payload = {
"prompt": "Summarize BYOM in one sentence."
}
response = requests.post(AML_ENDPOINT, json=payload, headers=headers)
print(response.json())
Option B: Tool-Based Invocation
- Expose the ML endpoint as an OpenAPI interface
- Allow higher-level orchestration layers (such as agents) to invoke it dynamically
Both patterns integrate cleanly with Azure App Services, Container Apps, Functions, and Kubernetes-based apps.
Operational Considerations
- Dependency management is ongoing work
- Model upgrades require redeployment
- Private networking must be planned early
- Use managed Foundry models where possible
- Use BYOM when business or regulatory needs require it
Security and Governance by Default
BYOM on Azure ML integrates natively with Azure platform controls:
- Entra ID & managed identity
- RBAC-based permissions
- Private networking and VNET isolation
- Centralized logging and diagnostics
This makes BYOM suitable for regulated industries and production‑critical AI workloads.
When Should You Use BYOM?
BYOM is the right choice when:
- You need model choice independence
- You want to deploy open‑source or proprietary LLMs
- You require enterprise‑grade controls
- You are building AI APIs, agents, or copilots at scale
For experimentation, higher‑level tooling may be faster. For production, BYOM provides the control and durability enterprises require.
Conclusion
Azure applications increasingly depend on AI, but models should not dictate architecture.
With Azure Machine Learning as the execution layer and Azure Apps as the orchestration layer, organizations can:
- combine managed and custom models
- Enforce security and compliance
- Scale AI workloads reliably
- Avoid platform and vendor lock-in
Bring Your Own Model (BYOM) is no longer a niche requirement. It is a foundational pattern for enterprise AI platforms.
Azure Machine Learning enables BYOM across open‑source models, fine‑tuned variants, and proprietary LLMs, allowing organizations to innovate without being locked into a single model provider.
You build the application.
Azure delivers the platform.
You own the model.
That is the essence of BYOM on Azure.