Blog Post

Apps on Azure Blog
7 MIN READ

Bring Your Own Model (BYOM) for Azure AI Applications using Azure Machine Learning

vaibhavpandey's avatar
vaibhavpandey
Icon for Microsoft rankMicrosoft
Apr 02, 2026

Vaibhav Pandey is a Senior Cloud Solution Architect at Microsoft focused on building production‑ready Azure AI solutions with Azure Machine Learning and Microsoft Foundry, spanning agentic and multi‑agent architectures, RAG pipelines, and BYOM extensibility with strong security and governance controls.

Modern AI-powered applications running on Azure increasingly require flexibility in model choice. While managed model catalogs accelerate time to value, real-world enterprise applications often need to:

  • Host open‑source or fine‑tuned models
  • Deploy domain‑specific or regulated models inside a tenant boundary
  • Maintain tight control over runtime environments and versions
  • Integrate AI inference into existing application architectures

This is where Bring Your Own Model (BYOM) becomes a core architectural capability, not just an AI feature.

In this post, we’ll walk through a production-ready BYOM pattern for Azure applications, using:

  • Azure Machine Learning as the model lifecycle and inference platform
  • Azure-hosted applications (and optionally Microsoft Foundry) as the orchestration layer

The focus is on building scalable, governable AI-powered apps on Azure, not platform lock‑in.

We use SmolLM‑135M as a reference model. The same pattern applies to any open‑source or proprietary model.

Reference Architecture: Azure BYOM for AI Applications

At a high level, the responsibilities are clearly separated:

Azure LayerResponsibility
Azure Application LayerAPI, app logic, orchestration, agent logic
Azure Machine LearningModel registration, environments, scalable inference
Azure Identity & NetworkingAuthentication, RBAC, private endpoints

 

Key principle:
Applications orchestrate. Azure ML executes the model.

This keeps AI workloads modular, auditable, and production-safe.

BYOM Workflow Overview

  1. Provision Azure Machine Learning
  2. Create Azure ML compute
  3. Author code in an Azure ML notebook
  4. Download and package the model
  5. Register the model
  6. Define a reproducible inference environment
  7. Implement scoring logic
  8. Deploy a managed online endpoint
  9. Use the endpoint from Microsoft Foundry

Step 1: Provision Azure Machine Learning

An Azure ML workspace is the governance boundary for BYOM:

  • Model versioning and lineage
  • Environment definitions
  • Secure endpoint hosting
  • Auditability

Choose region carefully for latency, data residency, and networking.

Step 2: Create Azure ML Compute (Compute Instance)

Create a Compute Instance in Azure ML Studio.

Why this matters:

  • Managed Jupyter environment
  • Identity integrated (no secrets in notebooks)
  • Ideal for model packaging and testing

- Enable auto‑shutdown for cost control
- CPU is sufficient for most development workflows

Step 3: Create an Azure ML Notebook

  • Open Azure ML Studio → Notebooks
  • Create a new Python notebook
  • Select the Python SDK v2 kernel

This notebook will handle the entire BYOM lifecycle.

Step 4: Connect to the Azure ML Workspace

# Import Azure ML SDK client
from azure.ai.ml import MLClient

# Import identity library for secure authentication
from azure.identity import DefaultAzureCredential

# Define workspace details
subscription_id = "<SUBSCRIPTION_ID>"
resource_group  = "<RESOURCE_GROUP>"
workspace_name  = "<WORKSPACE_NAME>"

# Create MLClient using Microsoft Entra ID
# No keys or secrets are embedded in code
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace_name
)

The code above uses enterprise identity and aligns with zero‑trust practices.

Step 5: Download and Package Model Artifacts

from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# Hugging Face model identifier
model_id = "HuggingFaceTB/SmolLM-135M"

# Local directory where model artifacts will be stored
model_dir = "smollm_135m"
os.makedirs(model_dir, exist_ok=True)

# Download model weights
model = AutoModelForCausalLM.from_pretrained(model_id)

# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Save artifacts locally
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

🔹 Open‑source or proprietary models follow the same packaging pattern
🔹 Azure ML treats all registered models identically

Step 6: Register the Model in Azure ML

Register the packaged artifacts as a custom model asset. Optionally, developers can:

  • Enables version tracking
  • Supports rolling upgrades
  • Integrates with CI/CD pipelines

This is the foundation for repeatable inference deployments.

from azure.ai.ml.entities import Model

# Create a model asset in Azure ML
registered_model = Model(
    path=model_dir,
    name="SmolLM-135M",
    description="BYOM model for Microsoft Foundry extensibility",
    type="custom_model"
)

# Register (or update) the model
ml_client.models.create_or_update(registered_model)

Step 7: Define a Reproducible Inference Environment

name: dev-hf-base
channels:
  - conda-forge
dependencies:
  - python=3.12
  - numpy=2.3.1
  - pip=25.1.1
  - scipy=1.16.1
  - pip:
      - azureml-inference-server-http==1.4.1
      - inference-schema[numpy-support]
      - accelerate==1.10.0
      - einops==0.8.1
      - torch==2.0.0
      - transformers==4.55.2

⚠️ Environment management is the hardest part of BYOM
✅ Treat environment changes like code changes

BYOM Inference Patterns

The same model can expose multiple behaviors.

Pattern 1: Text Generation Endpoint

This is the most common pattern for AI-powered applications:

  • REST-based text generation
  • Stateless inference
  • Horizontal scaling through Azure ML managed endpoints

Ideal for:

  • Copilots
  • Chat APIs
  • Summarization or content generation services
Scoring Script (score.py)
import os
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def init():
    """
    Called once when the container starts.
    Loads the model and tokenizer into memory.
    """
    global model, tokenizer

    # Azure ML injects model path at runtime
    model_dir = os.getenv("AZUREML_MODEL_DIR")

    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir)
    model.eval()

def run(raw_data):
    """
    Called for each inference request.
    Expects JSON input with a 'prompt' field.
    """
    data = json.loads(raw_data)
    prompt = data.get("prompt", "")

    # Tokenize input text
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate text without tracking gradients
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100)

    # Decode output tokens into text
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return {"response": response_text}
Example Request
{
  "prompt": "Summarize the BYOM pattern in one sentence."
}
Example Response
{
  "response": "Bring Your Own Model (BYOM) allows organizations to extend Microsoft Foundry with custom models hosted on Azure Machine Learning while maintaining enterprise governance and scalability."
}

Pattern 2: Predictive / Token Rank Analysis

The same model can expose non-generative behaviors, such as:

  • Token likelihood analysis
  • Ranking or scoring
  • Model introspection services

This enables AI-backed analytics capabilities, not just chat.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class PredictiveAnalysisModel:
    """
    Computes the rank of each token based on the model's
    next-token probability distribution.
    """

    def init(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.model.eval()

    def analyze(self, text):
        tokens = self.tokenizer.tokenize(text)
        token_ids = self.tokenizer.convert_tokens_to_ids(tokens)

        # Start with BOS token
        input_sequence = [self.tokenizer.bos_token_id, *token_ids]
        results = []

        for i in range(len(token_ids)):
            context = input_sequence[: i + 1]
            model_input = torch.tensor([context])

            with torch.no_grad():
                outputs = self.model(model_input)

            logits = outputs.logits[0, -1]
            sorted_indices = torch.argsort(logits, descending=True)

            actual_token = token_ids[i]
            rank = (sorted_indices == actual_token).nonzero(as_tuple=True)[0].item()

            results.append({
                "token": tokens[i],
                "rank": rank
            })

        return results

    @classmethod
    def from_disk(cls, model_path):
        model = AutoModelForCausalLM.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        return cls(model, tokenizer)
Scoring Script (score.py)
import os
from predictive_analysis import PredictiveAnalysisModel

def init():
    """
    Loads predictive analysis model from disk.
    """
    global model
    model_dir = os.getenv("AZUREML_MODEL_DIR")
    model = PredictiveAnalysisModel.from_disk(model_dir)

def run(text: str):
    """
    Accepts raw text input and returns token ranks.
    """
    return {
        "token_ranks": model.analyze(text)
    }
Example Request
{
  "text": "This is a test."
}
Example Response
{
  "token_ranks": [
    { "token": "This", "rank": 518 },
    { "token": " is", "rank": 2 },
    { "token": " a", "rank": 0 },
    { "token": " test", "rank": 33 },
    { "token": ".", "rank": 77 }
  ]
}

Consuming the BYOM Endpoint from Azure Applications

Azure ML endpoints are external inference services consumed by apps.

Option A: Application-Controlled Invocation
  • App calls Azure ML endpoint directly
  • IAM, networking, and retries controlled by the app
  • Recommended for most production systems
import requests
import os

AML_ENDPOINT = os.environ["AML_ENDPOINT"]
AML_KEY = os.environ["AML_KEY"]

headers = {
    "Authorization": f"Bearer {AML_KEY}",
    "Content-Type": "application/json"
}

payload = {
    "prompt": "Summarize BYOM in one sentence."
}

response = requests.post(AML_ENDPOINT, json=payload, headers=headers)
print(response.json())

 

Option B: Tool-Based Invocation

  • Expose the ML endpoint as an OpenAPI interface
  • Allow higher-level orchestration layers (such as agents) to invoke it dynamically

Both patterns integrate cleanly with Azure App Services, Container Apps, Functions, and Kubernetes-based apps.

Operational Considerations

  • Dependency management is ongoing work
  • Model upgrades require redeployment
  • Private networking must be planned early
  • Use managed Foundry models where possible
  • Use BYOM when business or regulatory needs require it

Security and Governance by Default

BYOM on Azure ML integrates natively with Azure platform controls:

  • Entra ID & managed identity
  • RBAC-based permissions
  • Private networking and VNET isolation
  • Centralized logging and diagnostics

This makes BYOM suitable for regulated industries and production‑critical AI workloads.

When Should You Use BYOM?

BYOM is the right choice when:

  • You need model choice independence
  • You want to deploy open‑source or proprietary LLMs
  • You require enterprise‑grade controls
  • You are building AI APIs, agents, or copilots at scale

For experimentation, higher‑level tooling may be faster. For production, BYOM provides the control and durability enterprises require.

Conclusion

Azure applications increasingly depend on AI, but models should not dictate architecture.

With Azure Machine Learning as the execution layer and Azure Apps as the orchestration layer, organizations can:

  • combine managed and custom models
  • Enforce security and compliance
  • Scale AI workloads reliably
  • Avoid platform and vendor lock-in

Bring Your Own Model (BYOM) is no longer a niche requirement. It is a foundational pattern for enterprise AI platforms.

Azure Machine Learning enables BYOM across open‑source models, fine‑tuned variants, and proprietary LLMs, allowing organizations to innovate without being locked into a single model provider.

You build the application.
Azure delivers the platform.
You own the model.

That is the essence of BYOM on Azure.

Published Apr 02, 2026
Version 1.0
No CommentsBe the first to comment