Semantic Kernel vs. Lang Chain
Readers have the opportunity to explore two different approaches - using either Semantic Kernel or Lang Chain.
For those interested, here's the link to the Semantic Kernel version of this tutorial: Teach ChatGPT to Answer Questions: Using Azure Cognitive Search & Azure OpenAI (Semantic Kernel.ver)
NOTE:
NOTE:
How to keep sensitive data private?
To ensure the privacy of sensitive data, Azure Cognitive Search provides a Personally Identifiable Information (PII) detection skill. This cognitive skill is specifically designed to identify and protect PII in your data. To learn more about Azure Cognitive Search's PII detection skill, read the following article.
Personally Identifiable Information (PII) Detection cognitive skill
- To enable this feature, select Extract Personally identifiable information.
You can change the fields to suit your data. I have attached a document with a description of each field in the index. (Depending on your settings for the index fields, the code you implement may differ from the tutorial.)
In this series, we will implement the feature to answer questions based on PDFs using Azure Cognitive Search and Azure OpenAI. In Series 2, we'll implement this feature in code.
Intent of the Code Design
The primary goal of the code design in this tutorial is to construct the code in a way that is easy to understand, especially for first-time viewers. Each segment of the code is encapsulated as a separate function. This modularization ensures that the main function acts as a clear, coherent guide through the logic of the system.
Ex. Part of the main function. (Semantic Kernel.ver)
def main():
…
kernel = await create_kernel(sk)
await create_embeddings(kernel)
await create_vector_store(kernel)
await store_documents(kernel, file_content)
…
Overview of the code
Part 1: Retrieving and Scoring Documents
We'll use Azure Cognitive Search to retrieve documents related to our question, score them for relevance to our question, and extract documents with a certain score or higher.
Part 2: Document Embedding and Vector Database Storage
We'll embed the documents we extracted in part 1. We will then store these embedded documents in a vector database, organizing them into pages (chunks of 5,000 characters each).
Part 3: Extracting Relevant Content and Implementing a Function to Answer Questions
We will extract the most relevant page from the vector database based on the question.
Then we will implement a function to generate answers from the extracted content.
"outputFieldMappings": [
{
"sourceFieldName": "/document/content/pages/*/keyphrases/*",
"targetFieldName": "keyphrases"
},
{
"sourceFieldName": "/document/content/pages/*",
"targetFieldName": "pages"
}
]
mkdir azure-proj
cd azure-proj
mkdir gpt-proj
cd gpt-proj1
Python -m venv .venv
.venv\Scripts\activate.bat
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install OpenAI
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install Langchain
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install faiss-cpu
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install tiktoken
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
def search_documents(question):
"""Search documents using Azure Cognitive Search"""
# Construct the Azure Cognitive Search service access URL
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
# Create a parameter dictionary
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
# Make a GET request to the Azure Cognitive Search service and store the response in a variable
resp = requests.get(url, headers=HEADERS, params=params)
# Return the JSON response containing the search results
return resp.json()
def filter_documents(search_results):
"""Filter documents that score above a certain threshold in semantic search"""
file_content = OrderedDict()
for result in search_results['value']:
# The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
Tips
If you recieve an Index error:
Example
(.venv) PS C:\Users\sms79\azure-proj\gpt-proj1> & c:/Users/sms79/azure-proj/gpt-proj1/.venv/Scripts/python.exe "c:/Users/sms79/azure-proj/gpt-proj1/example(lang_chain).py"
Total Documents Found: 5, Top Documents: 3
Traceback (most recent call last):
File "c:\Users\sms79\azure-proj\gpt-proj1\example(lang_chain).py", line 144, in <module>
main()
File "c:\Users\sms79\azure-proj\gpt-proj1\example(lang_chain).py", line 134, in main
vector_store = store_documents(docs, embeddings)
File "c:\Users\sms79\azure-proj\gpt-proj1\example(lang_chain).py", line 83, in store_documents
return FAISS.from_documents(docs, embeddings)
File "C:\Users\sms79\azure-proj\gpt-proj1\.venv\lib\site-packages\langchain\schema\vectorstore.py", line 438, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
File "C:\Users\sms79\azure-proj\gpt-proj1\.venv\lib\site-packages\langchain\vectorstores\faiss.py", line 603, in from_texts
return cls.__from(
File "C:\Users\sms79\azure-proj\gpt-proj1\.venv\lib\site-packages\langchain\vectorstores\faiss.py", line 562, in __from
index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: list index out of range
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = filter_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# execute the main function
if __name__ == "__main__":
main()
Now that Azure Cognitive Search is working well in VS Code, it's time to start using
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create
and run a program in `example.py` that answers a question with Azure OpenAI based on
the search information from Azure Cognitive Search.
1. We will create functions related to Azure OpenAI and Lang Chain and run them from
the main function.
- Add the following functions above the main function.
def create_embeddings():
"""Create an embedding model"""
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
def store_documents(docs, embeddings):
"""Create vector store and store documents in the vector store"""
return FAISS.from_documents(docs, embeddings)
def answer_with_langchain(vector_store, question):
"""Search for documents related to your question from the vector store
and answer question with search result using the lang chain"""
# add a chat service
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
2. Add the code below to your main function.
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
...
# Answer your question using the lang chain
embeddings = create_embeddings()
vector_store = store_documents(docs, embeddings)
result = answer_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
# execute the main function
if __name__ == "__main__":
main()
3. Now let's run it and see if it answers your question.
- The result of executing the code.
```
Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf
```
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
def search_documents(question):
"""Search documents using Azure Cognitive Search"""
# Construct the Azure Cognitive Search service access URL
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
# Create a parameter dictionary
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
# Make a GET request to the Azure Cognitive Search service and store the response in a variable
resp = requests.get(url, headers=HEADERS, params=params)
# Return the JSON response containing the search results
return resp.json()
def filter_documents(search_results):
"""Filter documents that score above a certain threshold in semantic search"""
file_content = OrderedDict()
for result in search_results['value']:
# The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
def create_embeddings():
"""Create an embedding model"""
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
def store_documents(docs, embeddings):
"""Create vector store and store documents in the vector store"""
return FAISS.from_documents(docs, embeddings)
def answer_with_langchain(vector_store, question):
"""Search for documents related to your question from the vector store
and answer question with search result using the lang chain"""
# add a chat service
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = filter_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# Answer your question using the lang chain
embeddings = create_embeddings()
vector_store = store_documents(docs, embeddings)
result = answer_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
# execute the main function
if __name__ == "__main__":
main()
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.