Working with Data in Azure Cosmos DB for PostgreSQL using Open PSQL Shell.
Published May 30 2024 12:00 AM 7,734 Views
Iron Contributor

kevin_comba_0-1717091614836.png

 

 

Previously we introduced Getting started with Azure Cosmos Database (A Deep Dive) blog which is an end-to-end introduction of Azure Cosmos DB. In this blog we are going to talk about one of the Azure Cosmos Database API known as Azure Cosmos Database for PostgreSQL.

 

Imagine a scenario where you are developing a multi-tenant application, as a developer, you have been working with PostgreSQL for years, you have joined a team that is developing an application that needs to handle data for thousands of tenants, each with isolated data. The team decides to use Azure Cosmos DB for PostgreSQL. 

 

What is covered in this Blog 

 

  • Challenges with traditional PostgreSQL 
  • What is Azure Cosmos DB for PostgreSQL 
  • Citus for PostgreSQL. 
  • Provisioning Azure Cosmos DB for PostgreSQL via azure Portal. 
  • Open PSQL Shell on Azure (new feature on preview) 
  • How to create & query distribute tables in Azure Cosmos DB for PostgreSQL   

In this blog, we shall leverage the azure cosmos DB for PostgreSQL which extends PostgreSQL with more enhanced capabilities. It will be a step-by-step guide on how to leverage this service which we are being provided and managed on Azure. 

 

Challenges with traditional PostgreSQL 

 

  • Scaling Horizontally: As the user base grows, you need to scale your database horizontally to accommodate more tenants. Traditional PostgreSQL doesn’t natively support easy horizontal scaling across multiple nodes. 
  • Global Distribution: You need a database solution that can seamlessly replicate data across different regions for low-latency access, which is normally not the case with PostgreSQL. 
  • High Availability and Failover: Traditional PostgreSQL requires manual setup for replication and failover. 

 

So, what is Azure Cosmos DB for PostgreSQL? 

 

Azure Cosmos DB for PostgreSQL is a managed service for PostgreSQL extended with Citus superpower of distributed tables. Citus enables you to scale your application as requirements grow. This can be done by distributing them into different nodes of distributed tables. 

Azure cosmosDB for postgreSQL empowers you to build a scalable and globally distributed application while leveraging your existing postgreSQL 

 

kevin_comba_2-1716534239227.png

 

 

Features of Azure Cosmos DB for PostgreSQL. 

 

  • Automatic High availability - Azure Cosmos DB for PostgreSQL ensures that your database remains available even in the face of failures. It automatically replicates data across multiple nodes and regions, minimizing downtime. 
  • Backups - Azure Cosmos DB for PostgreSQL automatically backs up your data, allowing you to restore it in case of accidental deletions, corruption, or other issues. 
  • Read-Replicas - Read-replicas allows you to offload read-heavy workloads from the primary database. This improves read performance and ensures that your application can handle high read traffic. 
  • Easy Monitoring - Monitoring your database’s health and performance is essential. Azure Cosmos DB for PostgreSQL provides built-in monitoring tools. 
  • Private Endpoint - By using private endpoints, you can restrict access to authorized networks, preventing exposure to the public internet. 
  • Encryption - Azure Cosmos DB for PostgreSQL encrypts data both at rest and in transit. 

 

Citus for PostgreSQL. 

 

Citus for PostgreSQL is a powerful extension that transforms Postgres into a distributed database, allowing you to achieve high performance at any scale. Each tenant’s data can reside on a separate shard, enabling parallel processing and efficient queries. As new tenants join, you can dynamically add shards without downtime. 

We will have a look for more about Citus later in the blog. To learn more about Citus, visit the following link: citus documentation 

 

Create an Azure Cosmos DB for PostgreSQL cluster in Azure portal 

 

Prerequisites 

 

 

Part A: Basics 

 

Step 1: Create a Cluster. 

 

 

kevin_comba_3-1716534239231.png

 

 

Step 2: Click on create. 

 

kevin_comba_4-1716534239233.png

 

 

Step 3: select Create on the PostgreSQL tile 

 

kevin_comba_5-1716534239236.jpeg

 

 

Step 4:  Fill in the form to provision our cluster 

 

 

kevin_comba_6-1716534239243.png

 

 

 

 Options you have: 

  • Subscription - choose the subscription to use. 
  • Resource Group - create resource group for your resource 
  • Cluster name - will determine the DNS name of your application. 
  • Location – Choose location near you. 
  • PostgreSQL version - choose major option like 15 or the latest you have. 
  • Database name - you can leave it as citus or change to your preferred name. 
  • Admin username - must be citus, do not change it. 
  • Password – choose a password for your access 
  • Scale – this option allows you to specify the amount of computing you need for your resource, feel free to click configure to adjust. 

 

kevin_comba_7-1716534239248.png

 

  

Part B: Networking 

 

Step 5: Select Next to proceed to networking 

 

  • Select Allow public access from Azure services and resources within Azure to this cluster. 

For our demo purposes, adding a firewall rule using Add 0.0.0.0-255.255.255.255, It will make your IP address access any host on the internet and it poses a threat. It is advised to use the rule for temporary purposes like testing a feature and later remove the rule. 

 

 

kevin_comba_8-1716534239251.jpeg

 

 

  • Select Review+ Create to do a validation then select Create to create the cluster. It will take a few minutes to deploy your resource, after it is done, click Go to resource. 

 

kevin_comba_9-1716534239253.jpeg

 

 

Congratulations, you have Created an Azure Cosmos DB for PostgreSQL cluster.     

Connect to a cluster via Open PSQL Shell on Azure 

 

Open PSQL shell (a feature on preview) is a terminal based front-end to PostgreSQL that enables you to type in queries interactively, issue them to PostgreSQL, and see the query results. With this feature, you don't need to copy your connection strings to connect. To be authenticated, you will be required to enter your password. 

In the image below, click on quick start to open the open PSQL shell window, click on Open PSQL shell icon indicated with number (2) at the top. You will open a terminal which will require you to enter the password you provided earlier. 

To check the available databases, write: 

 

 

kevin_comba_10-1716534239257.jpeg

 

 

Create and distribute tables in Azure Cosmos DB for PostgreSQL 

 

 

As earlier stated, I promised to explain more about citus, let us get a better understanding of citus for PostgreSQL. 

Citus for PostgreSQL is an extension that transforms PostgreSQL into a distributed database system. It is designed to horizontally scale out PostgreSQL across multiple servers, enabling it to handle larger datasets and higher query loads than a single PostgreSQL instance could manage alone.  

Citus achieves this by partitioning tables and distributing data across a cluster of PostgreSQL servers. It also provides parallel query execution across these servers, allowing queries to be processed more quickly. 

Citus is particularly useful for applications that require high scalability, such as large-scale analytics, real-time data processing, and multi-tenant applications. 

We shall see how citus will enable us to create distributed tables in PostgreSQL, but first let us create tables for our citus database to see its importance. 

 

Create Tables 

 

  • Run the following queries to create github_users and github_events tables:  

 

CREATE TABLE github_users ( user_id bigint, url text, login text, avatar_url text, gravatar_id text, display_login text );     

CREATE TABLE github_events(event_id bigint,event_type text,event_public boolean,repo_id bigint,payload jsonb,repo jsonb,user_id bigint,org jsonb,created_at timestamp); 

 

 

  • Create indexes  event_type_index and payload_index on github_users and github_events respectively. 

The indexes created are used to allow fast querying in the JSONB. 

 

CREATE INDEX event_type_index ON github_events (event_type); 

CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops); 

 

 

You should see the following on executing the commands. 

kevin_comba_11-1716534239260.jpeg

 

Distributed tables 

 

create_distributed_table() is the magic function that Azure Cosmos DB for PostgreSQL provides to distribute tables and use resources across multiple machines. The function decomposes tables into shards, which can be spread across nodes for increased storage and compute performance. 

  • Run the following commands to create distributed tables. 

 

SELECT create_distributed_table('github_users', 'user_id'); 

SELECT create_distributed_table('github_events', 'user_id'); 

 

Note that if you do not distribute tables, the worker nodes cannot help running your queries. 

  • Load Data into Distributed Tables 

To load a large amount of data, we are using data on azure blob storage. To achieve this, we first must create the extension in our database: 

 

SELECT * FROM create_extension('azure_storage'); 

 

 

  • Copy data from Blob Storage: Run the following commands to fetch data from CSV files and load into database tables: 

 -- download users and store in table  

 

COPY github_users FROM 'https://pgquickstart.blob.core.windows.net/github/users.csv.gz';  

 

 -- download events and store in table  

 

COPY github_events FROM 'https://pgquickstart.blob.core.windows.net/github/events.csv.gz'; 

 

You should see 264308 and 126245 has been successfully loaded within a short period of time. 

  • Viewing Distributed Tables 

We can review details of our distributed tables, including their sizes. Run the following commands: 

 

SELECT * FROM citus_tables; 

 

 

kevin_comba_12-1716534239263.jpeg

Run queries in Azure Cosmos DB for PostgreSQL 

 

Like any other database, we make use of it by querying to get information. Let us run some queries for our citus database which is powered with distributed tables. 

  • To check count of GitHub users: 
    • count all rows (across shards) 

 

SELECT count(*) FROM github_users; 

 

kevin_comba_13-1716534239265.jpeg

 

 

Azure Cosmos DB for PostgreSQL automatically runs the count on all shards in parallel, and combines the results because data is divided between multiple nodes. 

 

  • To find all events for a single user, run the following command. 

 

SELECT created_at, event_type, repo->>'name' AS repo_name   FROM github_events WHERE user_id = 3861633; 

 

 

kevin_comba_14-1716534239265.png

 

The results came in within a few seconds because of the power of the distributed tables in Azure Cosmos DB for PostgreSQL. 

I do hope you have learnt from the blog and now have an idea of the power of Azure Cosmos DB for PostgreSQL. It is now time to practice and develop applications that make use AI (Artificial Intelligence) capabilities for fast processing of data.  

 

Read more 

 

 

Available SDKs 

 

 

Version history
Last update:
‎May 30 2024 11:01 AM
Updated by: