LangChain integration with Azure Database for PostgreSQL (Part 1)
Published Apr 02 2024 02:10 PM 1,207 Views
Microsoft

A common question in my discussions with architects and data scientists over the past few weeks has been “When will you have LangChain integration for Azure Database for PostgreSQL?” Thankfully, we already have that support today. Any of the examples published by LangChain for OSS PostgreSQL should work with our service as well. At the end of the day, Azure Database for PostgreSQL is running OSS PostgreSQL.

 

This post will be the first in a three-part blog series that will demonstrate how to use LangChain, Azure OpenAI, and Azure Database for PostgreSQL to chunk, embed, and query documents from Azure Blob Storage.

  • Part 1: How to use LangChain to split documents into smaller chunks, generate embeddings for each chunk using Azure OpenAI, and store them in a PostgreSQL database via the pgvector extension. Then, we’ll perform a vector similarity search on the embedded documents.
  • Part 2: How to use LangChain to create a simple chat interface that allows users to ask natural language questions and get relevant answers from the documents embedded, using vector similarity search and language models.
  • Part 3: How to use LangChain to add chat history to the Postgres database and use it to provide more context and personalization for the language models and the chat responses.

 

The blog series will use a text data set generated from publicly available Azure DB for PostgreSQL documentation to walk you through your first text embedding and culminate in a simple customer support chatbot that can answer questions about Azure Database for PostgreSQL. Throughout the series, we’ll publish code samples to make your journey into the integration of LangChain and Azure DB for PostgreSQL as easy as possible.

Setup

Azure Resources

  • Azure Database for PostgreSQL Flexible Server instance running PG 14 or higher.
    • PG 13 is scheduled for EOL by the PostgreSQL Community in November 2025.
    • I’ve used PG16 in this series.
  • Azure OpenAI enabled subscription.
  • Azure OpenAI endpoint.

Local Software

 

Recommendations:

  • Install Python for all users to avoid Windows PATH issues.
    1. If you don’t do this, you’ll likely see an error when attempting to install the LangChain elements within your Python notebook.
  • Use a virtual environment for your Python work in VS Code.
    1. CTRL + SHIFT + P

JoshMSFT_0-1712085859096.png

 

Let’s get started.

Use Python’s package manager to install the necessary packages for this example.

 

 

!pip install langchain
!pip install langchain-openai
!pip install psycopg2
!pip install tiktoken
!pip install azure-storage-blob
!pip install unstructured

 

 

 

Import the LangChain modules and types and create the database connection string to your Azure Database for PostgreSQL instance.

 

import os
import psycopg2

from pgvector.psycopg2 import register_vector
from typing import List, Tuple
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredXMLLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.pgvector import PGVector
from langchain.vectorstores.pgvector import DistanceStrategy

username = '<username>'
password = '<password>'
host = '<database hostname>.postgres.database.azure.com'
port = '5432'
dbname = '<dbname>'

connection_string = f'postgresql+psycopg2://{username}:{password}@{host}:{port}/{dbname}'

 

 

 

If you’re unsure about your Azure Database for PostgreSQL connection details, you can find them in the Azure Portal by clicking on the Azure Database for PostgreSQL resource, and then clicking on “Connect”. Select the database you want to work with, and the page will generate your connection details for you to copy/paste.

 

JoshMSFT_1-1712085859102.png

 

Connect to the database and confirm the vector extension is enabled.

 

dbconn = psycopg2.connect(host=host, user=username, password=password,
port=port, database=dbname , connect_timeout=10)
dbconn.set_session(autocommit=True)

cur = dbconn.cursor()
cur.execute("SELECT oid, extname, extversion FROM pg_extension;")

 

 

The output should look like this:

JoshMSFT_2-1712085859102.png

 

In previous versions of the LangChain constructor of PGVector, you had to manually create and register the extension. The latest version performs that task for you. If you happen to need to manually create and register the pgvector extension for your version of LangChain, you would use the following example: 

 

 

dbconn = psycopg2.connect(host=host, user=username, password=password,
port=port, database=dbname , connect_timeout=10)
dbconn.set_session(autocommit=True)

cur = dbconn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")

register_vector(dbconn)

 

 

Setup an embeddings object using the embedding model deployed in your Azure OpenAI environment.

From the Azure OpenAI resource, click “Click here to view endpoints”

JoshMSFT_3-1712085859106.png

 

JoshMSFT_4-1712085859115.png

 

You can also click on Keys and Endpoint, from the Resource Management section, of the Azure Portal blade for your Azure OpenAI resource.

Copy your Endpoint URL and either one of the available access keys (either KEY 1 or KEY 2).

 

JoshMSFT_5-1712085859119.png

 

 

 

import os
os.environ["AZURE_OPENAI_API_KEY"] = "..."
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://<your-endpoint>.openai.azure.com/"

from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    azure_deployment="<your-embeddings-deployment-name>",
    openai_api_version="2023-05-15",
)

 

 

Setup a logging handler to deal with the LangChain logs.

 

import logging
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
rootLogger = logging.getLogger()

logPath = "./logs"
fileName = "langchain"

fileHandler = logging.FileHandler("{0}/{1}.log".format(logPath, fileName))
fileHandler.setFormatter(logFormatter)
rootLogger.addHandler(fileHandler)

consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)

rootLogger.setLevel(logging.DEBUG)

 

 

Use LangChain to split a text document and then use PGvector to insert the embeddings into PostgreSQL.
Here, we're loading one text file for troubleshooting high CPU utilization from Azure Blob Storage.
blob_name = file name of the file to ingest.

 

 

from langchain_community.document_loaders import AzureBlobStorageFileLoader 
loader = AzureBlobStorageFileLoader(conn_str="DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;EndpointSuffix=core.windows.net", container="<container name>", blob_name="how-to-high-troubleshoot-cpu-utilization.txt")
# use the text loader and splitter to break apart the document into chunks
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

collection_name = "high_cpu"

db = PGVector(
    collection_name= collection_name,
    connection_string=connection_string,
    embedding_function=embeddings,

 

 

If you have multiple documents, collections, or document types, you can create a new vector database object from all previous objects, organized by collection name and modify the distance strategy. LangChain supports:

  • EUCLIDEAN
  • COSINE
  • MAX_INNER_PRODUCT

Cosine is the current default.

 

 

 

db = db.from_documents(
    documents= docs,
    embedding = embeddings,
    collection_name= collection_name,
    distance_strategy = DistanceStrategy.COSINE,
    connection_string=connection_string,
    logger=rootLogger
)

 

Let’s perform a couple simple similarity searches of the now embedded guide for high CPU utilization.

Query 1: Where should we start to check the CPU utilization for our Azure Database for PostgreSQL instance?

 

 

#utilize a simple similarity search
query = "What is a good starting point to check the CPU utilization"

docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

 

Query 1 Results

 

JoshMSFT_7-1712086626241.png

 

Query 2: What query could I run to check high CPU utilization?

 

# utilize a simple similarity search
query = "What query can I run to check high CPU utilization"

docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

 

 

Query 2 Results

 

Score:  0.1695385868182192

```postgresql SELECT pid, usename, datname, query, now() - xact_start as duration FROM pg_stat_activity WHERE pid <> pg_backend_pid() and state IN ('idle in transaction', 'active') ORDER BY duration DESC; ```

### Total number of connections and number connections by state

A large number of connections to the database is also another issue that might lead to increased CPU and memory utilization.

The following query gives information about the number of connections by state:

```postgresql SELECT state, count(*) FROM  pg_stat_activity WHERE pid <> pg_backend_pid() GROUP BY 1 ORDER BY 1; ```

## Resolve high CPU utilization

Use Explain Analyze, PG Bouncer, connection pooling and terminate long running transactions to resolve high CPU utilization.
...
Consider these tools to identify high CPU utilization.

 

 

Conclusion

Throughout this blog, we’ve used LangChain to split documents into smaller chunks, generated embeddings for each chunk using Azure OpenAI, and stored them in our PostgreSQL database as our vector datastore via the pgvector extension hosted in our Azure Database for PostgreSQL. Finally, we performed a couple vector similarity searches on the indexed and embedded document with a natural language query.

 

Python notebook and other Azure AI Samples available here: Azure-Samples/azure-postgres-ai: Azure Database for PostgreSQL - AI Samples (github.com)

Common Issues

Windows error message when using “pip install langchain”

  • Python environment variable is unavailable/undefined.
  • Correct this by,
    •  using "!py -m pip install langchain"
    • updating your PATH
    • using a Python virtual environment in VS Code.

References

Co-Authors
Version history
Last update:
‎Apr 02 2024 09:34 PM
Updated by: