This article contains a step-by-step guide to deploying OpenMetadata on Azure Kubernetes Service (AKS), using Azure NetApp Files for storage. It also covers the deployment and configuration of PostgreSQL and OpenSearch databases to run externally from the Kubernetes cluster, following OpenMetadata best practices, managed by NetApp® Instaclustr®. This comprehensive tutorial aims to assist Microsoft and NetApp customers in overcoming the challenges of identifying and managing their data for AI/ML purposes. By following this guide, users will achieve a fully functional OpenMetadata instance, enabling efficient data discovery, enhanced collaboration, and robust data governance.

Co-authors:

Kyle Radder, Azure NetApp Files Technical Marketing Engineer
Michael Haigh, Senior Technical Marketing Engineer (NetApp)

Introduction

In this age of data-driven decision making, organizations are increasingly leveraging artificial intelligence and machine learning (AI/ML) to gain insights and drive innovation. However, enterprises face the challenge of identifying and managing their diverse data assets spread across various systems and platforms. Without a comprehensive understanding of available data, it’s difficult to use the full potential of data for AI/ML initiatives.

OpenMetadata, an open-source metadata management platform, offers a comprehensive solution to these challenges. By providing a centralized metadata catalog, OpenMetadata enables organizations to capture, organize, and discover metadata across various data sources. This ability enhances data visibility and also facilitates better data governance, collaboration, and lineage tracking—key components for successful AI/ML initiatives.

This article guides you through the process of deploying OpenMetadata on Azure Kubernetes Service (AKS), backed by Azure NetApp Files for storage. OpenMetadata’s PostgreSQL and OpenSearch databases run external to the AKS cluster with NetApp Instaclustr for optimal performance and reliability. By following this article, you will have a fully functional OpenMetadata instance that empowers you to discover and manage your data assets, facilitating your AI/ML initiatives.

Whether you're a data engineer, data scientist, or IT administrator, this step-by-step guide will help you overcome the challenges of data discovery and management, enabling you to unlock the potential of your data for AI/ML purposes.

Let's get started building a robust metadata management solution with OpenMetadata on Azure.

Prerequisites

To follow along step by step, make sure that the following items are available:

An Azure Account with the ability to create Azure Kubernetes Service (AKS) and Azure NetApp Files resources
A NetApp Instaclustr account with the ability to create databases
A local workstation with Git, Azure CLI, kubectl, helm, and Terraform installed

Workstation setup

This GitHub repository contains the Terraform code to create the required infrastructure and helm values to deploy OpenMetadata. Run the following command to clone the repository to the local workstation (or alternatively fork the repo if you expect to make changes) and change into the new directory:

git clone https://github.com/MichaelHaigh/netapp-1p-data-catalogs.git 
cd netapp-1p-data-catalogs/OpenMetadata/Azure

Install the necessary Terraform providers:

terraform init

Your workstation is now ready to deploy the necessary infrastructure, but let’s first inspect the repository contents.

Repository directory contents

The OpenMetadata/Azure directory contains several Terraform files for deploying the infrastructure and several markup files for deploying the application:

aks.tf: Terraform code that deploys a basic managed Kubernetes cluster
anf.tf: Terraform code that deploys Azure NetApp Files, which is used as the storage back end for the Kubernetes cluster
default.tfvars: Terraform variable values that enable customization of the OpenMetadata deployment with the default Terraform workspace
dependencies-values.yaml: helm values for OpenMetadata’s dependencies, primarily Apache® Airflow®
instaclustr.tf: Terraform code that deploys an Instaclustr managed PostgreSQL cluster running on Azure NetApp Files and an OpenSearch cluster
logs_dags_pvc.yaml: Kubernetes manifest that defines two ReadWriteMany persistent volume claims, which are provided by the default Azure NetApp Files storage class
main.tf: Terraform code that sets the required provider versions, including credential configuration
openmetadata-values.yaml: helm values for OpenMetadata, primarily referencing the external database and search configurations
outputs.tf: Terraform code that defines output variables of the to-be-deployed infrastructure, which is used when deploying the OpenMetadata application
scripts/:
- opensearch_setup.sh: shell script invoked by instaclustr.tf, which configures the OpenSearch cluster after deployment by creating an OpenMetadata user with necessary privileges
- postgresql_setup.sh: a shell script invoked by instaclustr.tf, which configures the PostgreSQL cluster after deployment by creating Airflow and OpenMetadata users and databases
- storage_class_setup.sh: a shell script invoked by aks.tf, which installs NetApp Trident™ on the AKS cluster and creates an Azure NetApp Files back end and storage class
variables.tf: Terraform code that defines input variables and correlates with tfvar files
vnet.tf: Terraform code that deploys Azure virtual networking

If you’re familiar with Terraform or Kubernetes and helm concepts, you can inspect these files to better understand the environment. However, most of the complexity and decision making are abstracted by Terraform variables, next.

Terraform variables file

As mentioned in the previous section, default.tfvars specifies the values that allow customization of each OpenMetadata deployment. This variable file (and the default workspace) are acceptable to use if you’re deploying a single OpenMetadata environment.

However, if your use case requires multiple OpenMetadata deployments (for example, prod and dr, or eastus and westus), you can create any number of new workspaces and copy the default variable file (be sure to match your workspace name):

terraform workspace new <workspace-name> 
cp default.tfvars <workspace-name>.tfvars

Each deployment can then be configured independently by using its corresponding variable file. Most of the variables can be left as–is; however, a handful must be changed for your environment.

Credentials

The credentials section of the variables file has two variables, both of which must be updated for your environment:

# Credentials 
sp_creds = "~/.azure/azure-sp-tme-demo2-terraform.json" 
ic_creds = "~/.instaclustr/instaclustr-creds.json"

The sp_creds variable points to the local file path of a credential file of a service principal, which has the following format:

{ 
        "subscriptionId": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee", 
        "appId": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee", 
        "displayName": "azure-sp-terraform", 
        "password": "aaaaaaaaaabbbbbbbbbbccccccccccdddddddddd", 
        "tenant": "aaaaaaaaaabbbbbbbbbbccccccccccdddddd" 
}

If you need to create a new service principal, follow the Azure Terraform provider documentation (the contributor role will have all the necessary privileges):

The subscriptionIdfield can be gathered from the subscription blade.
The appId(also known as the client ID), displayName, and tenant (also known as the directory ID) fields can be gathered from the app registration blade (step 1 in the Terraform provider documentation).
The passwordfield can be gathered (only at time of creation) from the Certificates & secrets: client secrets page of your app registration (step 2 in the Terraform provider documentation).

The ic_creds variable points to the local file path of an Instaclustr API key file with the following format:

{ 
        "username": "johndoe@domain.com", 
        "api_key": "aaaaaaaaaabbbbbbbbbbccccccccccdd" 
}

Both of these values can be gathered from the Gear Icon > Account Settings > API Keys section of the Instaclustr console (detailed instructions can be found here).

Azure settings

The Azure settings section contains to variables:

# Azure Settings 
azr_region = "eastus" 
creator_tag = "johndoe"

The azr_region variable specifies which region to deploy the non Instaclustr resources into; it can be changed. The creator_tag variable is used as an Azure tag to help identity and organize resources; it should be updated.

Instaclustr settings

The Instaclustr settings section contains three variables:

# Instaclustr Settings 
ic_region = "EAST_US" 
ic_provider_account = "riyoa-name" # Only needed for RIYOA, set to "" for Instaclustr's cloud 
ic_resource_group = "instaclustreastus"

ic_region: Should be the same region as specified in azr_region; however, the formatting is slightly different (see the Instaclustr-specific Azure terminology here).
ic_provider_account: Must be changed, to either your provider account name (if running in your own account (RIYOA)), or to an empty string ("") if running in the Instaclustr account.
ic_resource_group: Must be changed, to either the name of the resource group you want to deploy into for RIYOA environments, or to an empty string ("") if running in the Instaclustr account.

VNet settings

The VNet settings section contains the network and subnet addresses for all infrastructure:

# VNet Settings 
om_vnet_cidr = "10.20.0.0/22" 
om_vnet_dns_ip = "10.20.3.254" # must be w/in om_vnet_cidr 
om_anf_cidr = "10.20.2.0/24" # must be w/in om_vnet_cidr 
om_aks_nodepool_cidr = "10.20.0.0/23" # must be w/in 
om_vnet_cidr om_aks_services_cidr = "172.16.0.0/16" # must not be w/in om_vnet_cidr 
om_aks_services_dns_ip = "172.16.0.10" # must be w/in om_aks_services_cidr 
om_aks_pods_cidr = "172.18.0.0/16" # must not be w/in om_vnet_cidr 
postgresql_network = "10.0.0.0/16" # must not be w/in om_vnet_cidr 
postgresql_storage_net = "10.1.0.0/22" # must not be w/in om_vnet_cidr nor postgresql_network 
opensearch_network = "10.2.0.0/16" # must not be w/in om_vnet_cidr nor postgresql_network

These addresses can be left as is or configured for your environment. Be sure to take note of the comments, because certain addresses must or must not be within other address spaces.

AKS cluster settings

The AKS cluster section has four variables:

# AKS Cluster Settings 
aks_kubernetes_version = "1.30.7" 
aks_trident_version = "24.10.0" 
aks_node_count = 2 
aks_image_size = "Standard_D4s_v3"

At the time of publishing, these values can be left as–is. However, refer to the AKS Kubernetes release calendar to make sure that aks_kubernetes_version is still under support. Depending on the Kubernetes version deployed, Trident may need to be updated, but at the time of publishing, 24.10.0 is current.

The other two variables can be left as the default, or they can be updated according to use-case requirements.

Azure NetApp Files settings

The Azure NetApp Files section has two variables:

# ANF Settings 
anf_service_level = "Standard" 
anf_pool_size = 2

These variables specify the service level and capacity pool size. They can be left as is or customized for your environment.

PostgreSQL settings

The PostgreSQL section specifies the Instaclustr PostgreSQL cluster settings:

# PostgreSQL Settings 
postgresql_sla_tier = "NON_PRODUCTION" 
postgresql_version = "16.6.0" 
postgresql_replication = "ASYNCHRONOUS" 
postgresql_node_size = "PGS-PRD-Standard_E16s_v4-ANF-2048" 
postgresql_node_count = 2

These settings can all be left at their existing values. However, for production workloads, NetApp recommends modifying the SLA tier and replication to PRODUCTION and SYNCHRONOUS, respectively. Also note that the PostgreSQL node size specifies that it uses Azure NetApp Files as the storage provider.

OpenSearch settings

The OpenSearch section specifies the Instaclustr OpenSearch cluster settings:

# OpenSearch Settings 
opensearch_sla_tier = "NON_PRODUCTION" 
opensearch_version = "2.18.0" 
opensearch_data_node_count = 3 
opensearch_data_node_size = "SRH-PRD-D2s_v5-120-an" 
opensearch_dashboard_node_size = "SRH-PRD-D2s_v5-120-an" 
opensearch_manager_node_size = "SRH-DM-PRD-D2as_v4-16-an"

These settings can all be left at their existing values. However, for production workloads, NetApp recommends modifying the SLA tier to PRODUCTION.

Authorized networks

The authorized network section specifies a list of maps which enable the following:

# Authorized Networks 
authorized_networks = [ 
        { 
                cidr_block = "198.51.100.0/24" 
                display_name = "company_range" 
        }, 
        { 
                cidr_block = "203.0.113.30/32" 
                display_name = "home_address" 
        }, 
]

These values must be updated, at a minimum, to your current workstation’s IP address. (Run curl -4 ifconfig.me to get your IP address.) If you only need to specify a single address or subnet, remove the second map. Or, optionally, you can add any number of maps to specify additional networks, depending on your use case.

Infrastructure deployment

With the variable files properly configured, it’s time to deploy the infrastructure that OpenMetadata runs on:

terraform apply -var-file="$(terraform workspace show).tfvars"

Review the prompt that Terraform displays and enter yes if the plan is acceptable. It should take 15 to 20 minutes for everything to deploy, at which point you’ll see the following output:

Apply complete! Resources: 25 added, 0 changed, 0 destroyed. 

Outputs: 

az_kubeconfig_cmd = "az aks get-credentials --resource-group openmetadata-default-rg --name openmetadata-default-cluster" 
kubeconfig = <sensitive> 
openmetadata_airflow_password = <sensitive> 
opensearch_openmetadata_password = <sensitive> 
opensearch_private_address = "private-search.6c7736c38687463da6f3a0eac1c9fa44.cnodes.io" 
postgresql_airflow_password = <sensitive> 
postgresql_icpostgresql_password = <sensitive> 
postgresql_openmetadata_password = <sensitive> 
postgresql_private_address = "10.0.0.5"

These output variables will be used programmatically when the application is deployed, which is up next.

Application deployment

Now that the infrastructure has been created and configured, it’s time to deploy OpenMetadata. First, add the official OpenMetadata helm repository:

helm repo add open-metadata https://helm.open-metadata.org/ 
helm repo update

Due to the storage_class_setup.sh script, which was invoked as part of the Terraform apply, your kubectl configuration context will already be pointed at the deployed AKS cluster:

$ kubectl get storageclasses
kubectl get storageclasses
NAME                                    PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azure-netapp-files-standard (default)   csi.trident.netapp.io   Delete          Immediate              true                   14m
azurefile                               file.csi.azure.com      Delete          Immediate              true                   18m
azurefile-csi                           file.csi.azure.com      Delete          Immediate              true                   18m
azurefile-csi-premium                   file.csi.azure.com      Delete          Immediate              true                   18m
azurefile-premium                       file.csi.azure.com      Delete          Immediate              true                   18m
default                                 disk.csi.azure.com      Delete          WaitForFirstConsumer   true                   18m
managed                                 disk.csi.azure.com      Delete          WaitForFirstConsumer   true                   18m
managed-csi                             disk.csi.azure.com      Delete          WaitForFirstConsumer   true                   18m
managed-csi-premium                     disk.csi.azure.com      Delete          WaitForFirstConsumer   true                   18m
managed-premium                         disk.csi.azure.com      Delete          WaitForFirstConsumer   true                   18m

Run the following commands to create the Kubernetes namespace, secrets, and persistent volumes (note that the secrets are created by referencing the Terraform output variables shown earlier):

kubectl create namespace openmetadata
kubectl create secret -n openmetadata generic sql-secrets \
    --from-literal=airflow-sql-password=$(terraform output -raw postgresql_airflow_password) \
    --from-literal=openmetadata-sql-password=$(terraform output -raw postgresql_openmetadata_password)
kubectl create secret -n openmetadata generic elasticsearch-secrets \
    --from-literal=openmetadata-elasticsearch-password=$(terraform output -raw opensearch_openmetadata_password)
kubectl create secret -n openmetadata generic airflow-secrets \
    --from-literal=openmetadata-airflow-password=$(terraform output -raw openmetadata_airflow_password)
kubectl apply -f logs_dags_pvc.yaml

OpenMetadata has a few dependencies, primarily Apache Airflow, that must be installed next. As mentioned earlier, dependencies-values.yaml contains the nonsensitive helm values (sensitive values are set through the CLI), optionally modified for your environment. For more information on all possible helm value options, see the official repository.

helm install -n openmetadata openmetadata-dependencies open-metadata/openmetadata-dependencies \
    --set airflow.externalDatabase.host=$(terraform output -raw postgresql_private_address) \
    --values dependencies-values.yaml

After 5 to 10 minutes, the persistent volumes should be bound, and all pods should be in a ready state:

$ kubectl -n openmetadata get pods,pvc
NAME                                                           READY   STATUS    RESTARTS   AGE
pod/openmetadata-dependencies-db-migrations-797fd8b77d-cppfx   1/1     Running   0          11m
pod/openmetadata-dependencies-pgbouncer-68d7bbc96f-2mm65       1/1     Running   0          11m
pod/openmetadata-dependencies-scheduler-88c55846-9l7pc         1/1     Running   0          11m
pod/openmetadata-dependencies-sync-users-6484f6896d-2mj57      1/1     Running   0          11m
pod/openmetadata-dependencies-triggerer-75f8584fc9-vvrfx       1/1     Running   0          11m
pod/openmetadata-dependencies-web-6bd7b5698b-sqrwr             1/1     Running   0          11m
pod/opensearch-0                                               1/1     Running   0          11m

NAME                                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/openmetadata-dependencies-dags-pvc   Bound    pvc-0508e8e4-5633-46cb-b36e-2bbe1eac3315   50Gi       RWX            azure-netapp-files-standard   <unset>                 12m
persistentvolumeclaim/openmetadata-dependencies-logs-pvc   Bound    pvc-bd367907-c36d-40ec-a9c7-a1f251fed1db   50Gi       RWX            azure-netapp-files-standard   <unset>                 12m
persistentvolumeclaim/opensearch-opensearch-0              Bound    pvc-77d50368-650f-4771-afa7-c343062f4769   50Gi       RWO            azure-netapp-files-standard   <unset>                 11m

Finally, the openmetadata-values.yaml file specifies the OpenMetadata helm values. Optionally modify any values based on the official repository and deploy OpenMetadata:

helm install -n openmetadata openmetadata open-metadata/openmetadata \
    --set openmetadata.config.database.host=$(terraform output -raw postgresql_private_address) \
    --set openmetadata.config.elasticsearch.host=$(terraform output -raw opensearch_private_address) \
    --values openmetadata-values.yaml

When the openmetadata-0 pod is in a ready state, deployment of OpenMetadata is complete:

$ kubectl -n openmetadata get pod -l app.kubernetes.io/name=openmetadata
NAME                           READY   STATUS    RESTARTS   AGE
openmetadata-c4d4f88bc-9l7z8   1/1     Running   0          2m2s

Using OpenMetadata

Now that OpenMetadata has been deployed, you can access it through your favorite browser at the following URL:

echo http://$(kubectl -n openmetadata get service/openmetadata -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):8585

(i) Note

For production use cases, you should enable SSL through an AKS Ingress Controller, external Nginx, or directly within the OpenMetadata server, according to usual business practices.

Adding a service

Adding services and ingestions to OpenMetadata enables the extraction of service-specific metadata into the central metadata catalog. This connectivity ensures that the metadata catalog remains up to date and reflects the current state of data assets. There are many supported services (like NetApp® ONTAP®), but as an example we’ll add the Instaclustr PostgreSQL database we just deployed.

In the OpenMetadata UI, click Settings in the lower left corner.

Select the Services tile.

Select the Databases tile.

Click the Add New Service button.

Scroll down, or search for Postgres, select the Postgres Service, and click Next.

Give the service a descriptive name, such as instaclustr-openmetadata-default-postgresql, optionally provide a description, and click Next.

Specify icpostgresql as the Username and leave Basic Auth selected in the dropdown. Copy the output from the following command and paste it into the Password field.

terraform output -raw postgresql_icpostgresql_password

For the Host and Port field, copy and paste the output from this command (the virtual networks of the AKS cluster and Instaclustr resources are peered, enabling internal IP addresses):

terraform output -raw postgresql_private_address && echo -n :5432

Enter postgres as the Database, select Ingest All Databases, and scroll down.

Expand the Postgres Connection Advanced Config section, select require in the SSL Mode dropdown, collapse the Postgres Connection Advanced Config section, and scroll down.

Click Test Connection.

After 10 to 15 seconds, verify that each check and query returns Success. Click OK.

Click Save to finish adding the PostgreSQL service.

Adding an ingestion

Now that OpenMetadata has the credentials to communicate with PostgreSQL, it’s time to create an ingestion, which uses Airflow to perform the actual metadata collection. Click Add Ingestion.

Optionally specify a database for inclusion or exclusion. In this example, the instaclustr database is specified under Excludes, which results in all other databases being included in metadata extraction.

Scroll down, optionally modifying any default fields based on your use case, and click Next.

Metadata ingestion has the option to run on a regular schedule or on demand. A schedule is preferable for most production use cases. To have the ingestion happen quickly for demo purposes, Every Hour at 55 minutes past the hour is selected. Click Add and Deploy.

Click View Service to be redirected to the PostgreSQL service page.

Head back into your terminal to view the Airflow pod that was created for the ingestion.

$ kubectl -n openmetadata get pods -l task_id=ingestion_task NAME READY STATUS RESTARTS AGE 09e49cb9-4db8-4a92-bc54-0be48772b6c3-ingestion-task-hbqcoo5c 1/1 Running 0 3m48s

After a few minutes the ingestion is complete:

Navigate to the Databases tab to view the databases that were discovered during ingestion.

Search for one of the databases in the search bar at the top of the page and view the results in the dropdown. This search is powered by the Instaclustr OpenSearch cluster.

Cleanup

If you deployed OpenMetadata only for testing or demo purposes, you can destroy OpenMetadata and its infrastructure with the following commands:

helm -n openmetadata uninstall openmetadata
helm -n openmetadata uninstall openmetadata-dependencies
kubectl delete namespace openmetadata
terraform destroy -var-file="$(terraform workspace show).tfvars" --auto-approve

Summary

Deploying OpenMetadata on Azure Kubernetes Service (AKS) with Azure NetApp Files offers a robust solution for metadata management and data discovery, essential for AI/ML initiatives. By following this guide, you can establish a scalable and efficient metadata platform, enhancing collaboration and governance.

With OpenMetadata, your data teams gain a centralized view of data assets, improving their ability to identify and use data effectively. Leveraging Azure NetApp Files means high performance and reliability, and using external PostgreSQL and OpenSearch databases managed by NetApp Instaclustr optimizes database management.

This deployment helps NetApp customers overcome data silos, improve data quality, and fully leverage their data for AI/ML purposes. We hope that this article has been valuable in setting up your OpenMetadata instance, empowering your organization to make data-driven decisions and accelerate your AI/ML journey.