A common question in my discussions with architects and data scientists over the past few weeks has been “When will you have LangChain integration for Azure Database for PostgreSQL?” Thankfully, we already have that support today. Any of the examples published by LangChain for OSS PostgreSQL should work with our service as well. At the end of the day, Azure Database for PostgreSQL is running OSS PostgreSQL.
This post will be the first in a three-part blog series that will demonstrate how to use LangChain, Azure OpenAI, and Azure Database for PostgreSQL to chunk, embed, and query documents from Azure Blob Storage.
The blog series will use a text data set generated from publicly available Azure DB for PostgreSQL documentation to walk you through your first text embedding and culminate in a simple customer support chatbot that can answer questions about Azure Database for PostgreSQL. Throughout the series, we’ll publish code samples to make your journey into the integration of LangChain and Azure DB for PostgreSQL as easy as possible.
Recommendations:
!pip install langchain
!pip install langchain-openai
!pip install psycopg2
!pip install tiktoken
!pip install azure-storage-blob
!pip install unstructured
import os
import psycopg2
from pgvector.psycopg2 import register_vector
from typing import List, Tuple
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredXMLLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.pgvector import PGVector
from langchain.vectorstores.pgvector import DistanceStrategy
username = '<username>'
password = '<password>'
host = '<database hostname>.postgres.database.azure.com'
port = '5432'
dbname = '<dbname>'
connection_string = f'postgresql+psycopg2://{username}:{password}@{host}:{port}/{dbname}'
If you’re unsure about your Azure Database for PostgreSQL connection details, you can find them in the Azure Portal by clicking on the Azure Database for PostgreSQL resource, and then clicking on “Connect”. Select the database you want to work with, and the page will generate your connection details for you to copy/paste.
dbconn = psycopg2.connect(host=host, user=username, password=password,
port=port, database=dbname , connect_timeout=10)
dbconn.set_session(autocommit=True)
cur = dbconn.cursor()
cur.execute("SELECT oid, extname, extversion FROM pg_extension;")
The output should look like this:
In previous versions of the LangChain constructor of PGVector, you had to manually create and register the extension. The latest version performs that task for you. If you happen to need to manually create and register the pgvector extension for your version of LangChain, you would use the following example:
dbconn = psycopg2.connect(host=host, user=username, password=password,
port=port, database=dbname , connect_timeout=10)
dbconn.set_session(autocommit=True)
cur = dbconn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
register_vector(dbconn)
From the Azure OpenAI resource, click “Click here to view endpoints”
You can also click on Keys and Endpoint, from the Resource Management section, of the Azure Portal blade for your Azure OpenAI resource.
Copy your Endpoint URL and either one of the available access keys (either KEY 1 or KEY 2).
import os
os.environ["AZURE_OPENAI_API_KEY"] = "..."
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://<your-endpoint>.openai.azure.com/"
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
azure_deployment="<your-embeddings-deployment-name>",
openai_api_version="2023-05-15",
)
import logging
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s")
rootLogger = logging.getLogger()
logPath = "./logs"
fileName = "langchain"
fileHandler = logging.FileHandler("{0}/{1}.log".format(logPath, fileName))
fileHandler.setFormatter(logFormatter)
rootLogger.addHandler(fileHandler)
consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)
rootLogger.setLevel(logging.DEBUG)
Use LangChain to split a text document and then use PGvector to insert the embeddings into PostgreSQL.
Here, we're loading one text file for troubleshooting high CPU utilization from Azure Blob Storage.
blob_name = file name of the file to ingest.
from langchain_community.document_loaders import AzureBlobStorageFileLoader
loader = AzureBlobStorageFileLoader(conn_str="DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;EndpointSuffix=core.windows.net", container="<container name>", blob_name="how-to-high-troubleshoot-cpu-utilization.txt")
# use the text loader and splitter to break apart the document into chunks
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
collection_name = "high_cpu"
db = PGVector(
collection_name= collection_name,
connection_string=connection_string,
embedding_function=embeddings,
If you have multiple documents, collections, or document types, you can create a new vector database object from all previous objects, organized by collection name and modify the distance strategy. LangChain supports:
Cosine is the current default.
db = db.from_documents(
documents= docs,
embedding = embeddings,
collection_name= collection_name,
distance_strategy = DistanceStrategy.COSINE,
connection_string=connection_string,
logger=rootLogger
)
Query 1: Where should we start to check the CPU utilization for our Azure Database for PostgreSQL instance?
#utilize a simple similarity search
query = "What is a good starting point to check the CPU utilization"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("-" * 80)
print("Score: ", score)
print(doc.page_content)
print("-" * 80)
Query 2: What query could I run to check high CPU utilization?
# utilize a simple similarity search
query = "What query can I run to check high CPU utilization"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("-" * 80)
print("Score: ", score)
print(doc.page_content)
print("-" * 80)
Score: 0.1695385868182192
```postgresql SELECT pid, usename, datname, query, now() - xact_start as duration FROM pg_stat_activity WHERE pid <> pg_backend_pid() and state IN ('idle in transaction', 'active') ORDER BY duration DESC; ```
### Total number of connections and number connections by state
A large number of connections to the database is also another issue that might lead to increased CPU and memory utilization.
The following query gives information about the number of connections by state:
```postgresql SELECT state, count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP BY 1 ORDER BY 1; ```
## Resolve high CPU utilization
Use Explain Analyze, PG Bouncer, connection pooling and terminate long running transactions to resolve high CPU utilization.
...
Consider these tools to identify high CPU utilization.
Throughout this blog, we’ve used LangChain to split documents into smaller chunks, generated embeddings for each chunk using Azure OpenAI, and stored them in our PostgreSQL database as our vector datastore via the pgvector extension hosted in our Azure Database for PostgreSQL. Finally, we performed a couple vector similarity searches on the indexed and embedded document with a natural language query.
Python notebook and other Azure AI Samples available here: Azure-Samples/azure-postgres-ai: Azure Database for PostgreSQL - AI Samples (github.com)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.