Forum Discussion

ani_ms_emea's avatar
ani_ms_emea
Icon for Microsoft rankMicrosoft
Oct 25, 2025

BYOPI - Design your own custom private AI Search indexer with no code ADF (SQLServer on VM example)

Executive Summary

Building a fully private search indexing solution using Azure Data Factory (ADF) to sync SQL Server data from private VM to Azure AI Search is achievable but comes with notable complexities and limitations. 

This blog shares my journey, discoveries, and honest assessment of the BYOPI (Build Your Own Private Indexer) architecture.

 

Architectural flow:

 

Table of Contents

  1. Overall Setup
  2. How ADF works in this approach with Azure AI Search
  3. Challenges - discovered
  4. Pros and Cons: An Honest Assessment
  5. Conclusion and Recommendations

 

1. Overall Setup:

Phase 1: Resource Group & Network Setup : create resource group and vNET (virtual network) in any region of your choice

Phase 2: Deploy SQL Server VM:

Phase 3: Create Azure Services - ADF (Azure Data Factory), Azure AI Search  and AKV (Azure Key Vault) service from portal or from your choice of deployment.

 

Phase 4: Create Private Endpoints for all the services in their dedicated subnets:

 

Phase 5: Configure SQL Server on VM : connect to VM via bastion and setup database, tables & SP:

 

Sample metadata used as below:

CREATE DATABASE BYOPI_DB;
GO

USE BYOPI_DB;
GO

CREATE TABLE Products (
    ProductId INT IDENTITY(1,1) PRIMARY KEY,
    ProductName NVARCHAR(200) NOT NULL,
    Description NVARCHAR(MAX),
    Category NVARCHAR(100),
    Price DECIMAL(10,2),
    InStock BIT DEFAULT 1,
    Tags NVARCHAR(500),
    IsDeleted BIT DEFAULT 0,
    CreatedDate DATETIME DEFAULT GETDATE(),
    ModifiedDate DATETIME DEFAULT GETDATE()
);

CREATE TABLE WatermarkTable (
    TableName NVARCHAR(100) PRIMARY KEY,
    WatermarkValue DATETIME
);

INSERT INTO WatermarkTable VALUES ('Products', '2024-01-01');

CREATE PROCEDURE sp_update_watermark 
    @TableName NVARCHAR(100),
    @NewWatermark DATETIME
AS
BEGIN
    UPDATE WatermarkTable 
    SET WatermarkValue = @NewWatermark 
    WHERE TableName = @TableName;
END;

INSERT INTO Products (ProductName, Description, Category, Price, Tags) VALUES
('Laptop Pro', 'High-end laptop', 'Electronics', 1299.99, 'laptop,computer'),
('Office Desk', 'Adjustable desk', 'Furniture', 599.99, 'desk,office'),
('Wireless Mouse', 'Bluetooth mouse', 'Electronics', 29.99, 'mouse,wireless');

 

 

 

 

Phase 6: Install Self-Hosted Integration Runtime

Create SHIR in ADF:
  1. Go to ADF resource in Azure Portal
  2. Click "Open Azure Data Factory Studio"
    • Note: You need to access from a VM in the same VNet or via VPN since ADF is private
  3. In ADF Studio, click Manage (toolbox icon)
  4. Select Integration runtimes → "+ New"
  5. Select "Azure, Self-Hosted" → "Self-Hosted"
  6. Name: SHIR-BYOPI or of your choice
  7. Click "Create"
  8. Copy Key1 (save it)
Install SHIR on VM
  1. In the VM (via Bastion):
  2. Open browser, go to: https://www.microsoft.com/download/details.aspx?id=39717
  3. Download and install Integration Runtime
  4. During setup:
    • Launch Configuration Manager
    • Paste the Key1 from Step 14
    • Click "Register"
    • Wait for "Connected" status

 

 

 

 

 

 

 

 

 

 

 

 

Phase 7: Create Search Index through below powershell script and saving it as search_index.ps1

 

$searchService = "search-byopi"
$apiKey = "YOUR-ADMIN-KEY"

$headers = @{
    'api-key' = $apiKey
    'Content-Type' = 'application/json'
}

$index = @{
    name = "products-index"
    fields = @(
        @{name="id"; type="Edm.String"; key=$true}
        @{name="productName"; type="Edm.String"; searchable=$true}
        @{name="description"; type="Edm.String"; searchable=$true}
        @{name="category"; type="Edm.String"; filterable=$true; facetable=$true}
        @{name="price"; type="Edm.Double"; filterable=$true}
        @{name="inStock"; type="Edm.Boolean"; filterable=$true}
        @{name="tags"; type="Collection(Edm.String)"; searchable=$true}
    )
} | ConvertTo-Json -Depth 10

Invoke-RestMethod `
    -Uri "https://$searchService.search.windows.net/indexes/products-index?api-version=2020-06-30" `
    -Method PUT `
    -Headers $headers `
    -Body $index

 

 

 

 

Phase 8: Configure AKV & ADF Components - Link AKV and ADF for secrets

Create Key Vault Secrets

  1. Navigate to kv-byopi (created AKV resource) in Portal
  2. Go to "Access policies"
  3. Click "+ Create"
  4. Select permissions: Get, List for secrets
  5. Select principal: adf-byopi-private
  6. Create
  7. Go to "Secrets" → "+ Generate/Import":
    • Name: sql-password, Value: <>
    • Name: search-api-key, Value: Your search key
Create Linked Services in ADF

Access ADF Studio from the VM (since it's private):

  1. Key Vault Linked Service:
  1. Manage → Linked services → "+ New"
  2. Search "Azure Key Vault"
  3. Configure:
    • Name: LS_KeyVault
    • Azure Key Vault: kv-byopi
    • Integration runtime: AutoResolveIntegrationRuntime
  4. Test connection → Create
  1. SQL Server Linked Service:
  1. "+ New" → "SQL Server"
  2. Configure:
    • Name: LS_SqlServer
    • Connect via: SHIR-BYOPI
    • Server name: localhost
    • Database: BYOPI_DB
    • Authentication: SQL Authentication
    • User: sqladmin
    • Password: Select from Key Vault → LS_KeyVault → sql-password
  3. Test → Create
  1. Azure Search Linked Service:
  1. "+ New" → "Azure Search"
  2. Configure:
    • Name: LS_AzureSearch
    • URL: https://search-byopi.search.windows.net
    • Connect via: SHIR-BYOPI  - Important - use SHIR
    • API Key: From Key Vault → LS_KeyVault → search-api-key
  3. Test → Create

 

 

Phase 9: Create ADF Datasets and PipelineCreate Datasets

 

SQL Products Dataset:

  1. Author → Datasets → "+" → "New dataset"
  2. Select "SQL Server" → Continue
  3. Select "Table" → Continue
  4. Properties:
    • Name: DS_SQL_Products
    • Linked service: LS_SqlServer
    • Table: Select Products
  5. click OK
  1. Watermark Dataset:
  1. Repeat with:
    • Name: DS_SQL_Watermark
    • Table: WatermarkTable
  1. Search Dataset:
  1. "+" → "Azure Search"
  2. Properties:
    • Name: DS_Search_Index
    • Linked service: LS_AzureSearch
    • Index name: products-index
Create Pipeline
  1. Author → Pipelines → "+" → "Pipeline"
  2. Name: PL_BYOPI_Private
  3. From Activities → General, drag "Lookup" activity
  4. Configure Lookup 1:
    • Name: LookupOldWatermark
    • Settings:
      • Source dataset: DS_SQL_Watermark
      • Query: below

sql

SELECT WatermarkValue FROM WatermarkTable WHERE TableName='Products'

- **First row only**: ✓

  1. Add another Lookup:
  • Name: LookupNewWatermark
  • Query: below

sql

SELECT MAX(ModifiedDate) as NewWatermark FROM Products

  1. Add Copy Data activity:
    • Name: CopyToSearchIndex
    • Source:
      • Dataset: DS_SQL_Products
      • Query:

sql

SELECT CAST(ProductId AS NVARCHAR(50)) as id, ProductName as productName, Description as description, Category as category, Price as price, InStock as inStock, Tags as tags, CASE WHEN IsDeleted = 1 THEN 'delete' ELSE 'upload' END as [@search.action] FROM Products WHERE ModifiedDate > '@{activity('LookupOldWatermark').output.firstRow.WatermarkValue}' AND ModifiedDate <= '@{activity('LookupNewWatermark').output.firstRow.NewWatermark}'

  • Sink:
    • Dataset: DS_Search_Index
    • Write behavior: Merge
    • Batch size: 1000
  1. Add Stored Procedure activity:
    • Name: UpdateWatermark
    • SQL Account: LS_SqlServer
    • Stored procedure: sp_update_watermark
    • Parameters:
      • TableName: Products
      • NewWatermark: @{activity('LookupNewWatermark').output.firstRow.NewWatermark}
  2. Connect activities with success conditions
Phase 10: Test and Schedule
Test Pipeline
  1. Click "Debug" in pipeline
  2. Monitor in Output panel
  3. Check for green checkmarks
Create Trigger
  1. In pipeline, click "Add trigger" → "New/Edit"
  2. Click "+ New"
  3. Configure:
    • Name: TR_Hourly
    • Type: Schedule
    • Recurrence: Every 1 Hour
  4. OK → Publish All
Monitor
  1. Go to Monitor tab
  2. View Pipeline runs
  3. Check Trigger runs

 

 

 

Your pipeline should look like this:

 

 

Phase 11: Validation & Testing

Verify Private Connectivity
From the VM, run PowerShell:
# Test DNS resolution (should return private IPs)
nslookup adf-byopi-private.datafactory.azure.net
# Should show private IP like : 10.0.2.x

nslookup search-byopi.search.windows.net
# Should show private IP like : 10.0.2.x

nslookup kv-byopi.vault.azure.net
# Should show private IP like : 10.0.2.x

# Test Search
$headers = @{ 'api-key' = 'YOUR-KEY' }
Invoke-RestMethod -Uri "https://search-byopi.search.windows.net/indexes/products-index/docs?`$count=true&api-version=2020-06-30" -Headers $headers

Test Data Sync (adding few records) and verify in search index:

 

-- Add test record INSERT INTO Products (ProductName, Description, Category, Price, Tags) VALUES ('Test Product Private', 'Testing private pipeline', 'Test', 199.99, 'test,private'); -- Trigger pipeline manually or wait for schedule -- Then verify in Search index

 

2. How ADF works in this approach with Azure AI search:

Azure AI Search uses a REST API for indexing or called as uploading. When ADF sink uploads data to AI Search, it's actually making HTTP POST requests: for example -

POST https://search-byopi.search.windows.net/indexes/products-index/docs/index?api-version=2020-06-30
Content-Type: application/json
api-key: YOUR-ADMIN-KEY

{
  "value": [
    {
      "@search.action": "upload",
      "id": "1",
      "productName": "Laptop",
      "price": 999.99
    },
    {
      "@search.action": "delete",
      "id": "2"
    }
  ]
}

 

Delete action used here is soft delete and not hard delete.

 

pipeline query:

SELECT 
    CAST(ProductId AS NVARCHAR(50)) as id,  -- Renamed to match index field
    ProductName as productName,              -- Renamed to match index field
    Description as description,
    Category as category,
    Price as price,
    InStock as inStock,
    Tags as tags,
    CASE 
        WHEN IsDeleted = 1 THEN 'delete'
        ELSE 'upload'
    END as [@search.action]                  -- Special field with @ prefix
FROM Products
WHERE ModifiedDate > '2024-01-01'
```

Returns this resultset:
```
id  | productName    | description      | category    | price  | inStock | tags           | @search.action
----|----------------|------------------|-------------|--------|---------|----------------|---------------
1   | Laptop Pro     | High-end laptop  | Electronics | 1299   | 1       | laptop,computer| upload
2   | Office Chair   | Ergonomic chair  | Furniture   | 399    | 1       | chair,office   | upload
3   | Deleted Item   | Old product      | Archive     | 0      | 0       | old            | delete

The @search.action Field - The Magic Control

This special field tells Azure AI Search what to do with each document:

@search.actionWhat It DoesWhen to UseWhat Happens If Document...
uploadInsert OR UpdateMost common - upsert operationExists: Updates it<br>Doesn't exist: Creates it
mergeUpdate onlyWhen you know it existsExists: Updates specified fields<br>Doesn't exist: ERROR
mergeOrUploadUpdate OR InsertSafe updateExists: Updates fields<br>Doesn't exist: Creates it
deleteRemove from indexTo remove documentsExists: Deletes it<br>Doesn't exist: Ignores (no error)

 

ADF automatically converts SQL results to JSON format required by Azure Search:

 

{
  "value": [
    {
      "@search.action": "upload",
      "id": "1",
      "productName": "Laptop Pro",
      "description": "High-end laptop",
      "category": "Electronics",
      "price": 1299.00,
      "inStock": true,
      "tags": "laptop,computer"
    },
    {
      "@search.action": "upload",
      "id": "2",
      "productName": "Office Chair",
      "description": "Ergonomic chair",
      "category": "Furniture",
      "price": 399.00,
      "inStock": true,
      "tags": "chair,office"
    },
    {
      "@search.action": "delete",
      "id": "3"  // For delete, only ID is needed
    }
  ]
}

ADF doesn't send all records at once. It batches them based on writeBatchSize and each batch is a separate HTTP POST to Azure Search

 

How ADF will detect new changes and run batches:

Watermark will be updated after each successful ADF run to detect new changes  as below:

 

Handling different scenarios:

 

Scenario 1: No Changes Between Runs:

Run at 10:00 AM:
- Old Watermark: 09:45:00
- New Watermark: 10:00:00
- Query: WHERE ModifiedDate > '09:45' AND <= '10:00'
- Result: 0 rows
- Action: Still update watermark to 10:00
- Why: Prevents reprocessing if changes come later

 

Scenario 2: Bulk Insert Happens:

Someone inserts 5000 records at 10:05 AM

Run at 10:15 AM:
- Old Watermark: 10:00:00
- New Watermark: 10:15:00
- Query: WHERE ModifiedDate > '10:00' AND <= '10:15'
- Result: 5000 rows
- Action: Process all 5000, update watermark to 10:15

 

Scenario 3: Pipeline Fails

Run at 10:30 AM:
- Old Watermark: 10:15:00 (unchanged from last success)
- Pipeline fails during Copy activity
- Watermark NOT updated (still 10:15:00)

Next Run at 10:45 AM:
- Old Watermark: 10:15:00 (still the last successful)
- New Watermark: 10:45:00
- Query: WHERE ModifiedDate > '10:15' AND <= '10:45'
- Result: Gets ALL changes from 10:15 to 10:45 (30 minutes of data)
- No data loss!

 

Note:

There is still room for improvement by refining this logic to handle more advanced scenarios. However, I have not examined the logic in depth, as the goal here is to review how the overall setup functions, identify its limitations, and compare it with the indexing solutions available in AI Search.

 

3. Challenges - disovered:

When I tried to set out to build a private search indexer for SQL Server data residing on an Azure VM with no public IP, the solution seemed straightforward: use Azure Data Factory to orchestrate the data movement to Azure AI Search. 

The materials made it sound simple. The reality? It's possible, but the devil is in the details.

 

What We Needed:
✅ SQL Server on private VM (no public IP)
✅ Azure AI Search with private endpoint
✅ No data over public internet
✅ Support for full CRUD operations
✅ Near real-time synchronization
✅ No-code/low-code solution

 

Reality Check:
⚠️ DELETE operations not natively supported in ADF sink
⚠️ Complex networking requirements
⚠️ Higher costs than expected
⚠️ Significant setup complexity
✅ But it IS possible with workarounds

Components Required
  • Azure VM: ~$150/month (D4s_v3)
  • Self-Hosted Integration Runtime: Free (runs on VM)
  • Private Endpoints: ~$30/month (approx 3 endpoints)
  • Azure Data Factory: ~$15-60/month (depends on frequency)
  • Azure AI Search: ~$75/month (Basic tier)
  • Total: ~$270-315/month**

 

The DELETE Challenge:

Despite Azure AI Search REST API fully supporting delete operations via @search.action, ADF's native Azure Search sink does NOT support delete operations. This isn't clearly documented and catches many architects off guard.

-- This SQL query with delete action
SELECT 
    ProductId as id,
    CASE WHEN IsDeleted = 1 THEN 'delete' ELSE 'upload' END as [@search.action]
FROM Products

-- Will NOT delete documents in Azure Search when using Copy activity
-- The @search.action = 'delete' is ignored by ADF sink!

 

Nevertheless, there is a workaround using the Web Activity approach or by calling the REST API from the ADF side to perform the delete operation.

{
  "name": "DeleteViaREST",
  "type": "Web",
  "typeProperties": {
    "url": "https://search.windows.net/indexes/index/docs/index",
    "method": "POST",
    "body": {
      "value": [
        {"@search.action": "delete", "id": "123"}
      ]
    }
  }
}

 

Development Challenges
  1. No Direct Portal Access: With ADF private, you need:
    • Jump box in the same VNet
    • VPN connection
    • Bastion for access
  2. Testing Complexity:
    • Can't use Postman from local machine
    • Need to test from within VNet
    • Debugging requires multiple tools

 

 

4. Pros and Cons: An Honest Assessment:

Pros:
  1. Security: Complete network isolation
  2. Compliance: Meets strict requirements
  3. No-Code: Mostly configuration-based
  4. Scalability: Can handle large datasets
  5. Monitoring: Built-in ADF monitoring
  6. Managed Service: Microsoft handles updates
Cons:
  1. DELETE Complexity: Not natively supported
  2. Cost: Higher than expected
  3. Setup Complexity: Many moving parts
  4. Debugging: Difficult with private endpoints

 

Hidden Gotchas:


- SHIR requires Windows VM (Linux in preview)
- Private endpoint DNS propagation delays
- ADF Studio timeout with private endpoints
- SHIR auto-update can break pipelines

 

 

5. Conclusion and Recommendations:

When to Use BYOPI:

Good Fit:

  • Strict security requirements
  • Needs indexing from an un-supported scenarios for example SQL server residing on private VM
  • Budget > $500/month 
  • Team familiar with Azure networking
  • Read-heavy workloads

❌ Poor Fit:

  • Simple search requirements
  • Budget conscious
  • Need real-time updates
  • Heavy DELETE operations
  • Small team without Azure expertise

 

BYOPI works, but it's more complex and expensive than initially expected. The lack of native DELETE support in ADF sink is a significant limitation that requires workarounds.

Key Takeaways
  1. It works but requires significant effort
  2. DELETE (hard) operations need workarounds
  3. Costs will be higher than expected
  4. Complexity is substantial for a "no-code" solution
  5. Alternative solutions might be better for many scenarios

 

Disclaimer:

The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information.

 

Thanks for reading this blog! I hope you've found this approach of creating own private indexing solution for Azure AI Search (BYOPI) useful 😀

 

No RepliesBe the first to reply

Resources