Teach ChatGPT to Answer Questions Based on PDF content: Using Azure AI Search and Azure OpenAI (Lang Chain.ver)
Semantic Kernel vs. Lang Chain
Readers have the opportunity to explore two different approaches - using either Semantic Kernel or Lang Chain.
For those interested, here's the link to the Semantic Kernel version of this tutorial: Teach ChatGPT to Answer Questions: Using Azure AI Search & Azure OpenAI (Semantic Kernel.ver)
Can't I just copy and paste text from a PDF file to teach ChatGPT?
The purpose of this tutorial is to explain how to efficiently extract and use information from large amounts of PDFs. Dealing with a 5-page PDF can be straightforward, but it's a different story when you're dealing with complex documents of 100+ pages. In these situations, the integration of Azure AI Search with Azure OpenAI enables fast and accurate information retrieval and processing. In this tutorial, we handle 5 PDFs, but you can apply this method to scale to handle more than 10,000 files. In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure AI Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI. Here is an overview of this tutorial.
This tutorial is related to the following topics
Learning objectives
Prerequisites
Microsoft Cloud Technologies used in this Tutorial
Table of Contents
Series 2: Implement a ChatGPT Service with Azure OpenAI
Series 1: Extract Key Phrases for Search Queries Using Azure AI Search
1. Create a Blob Container
2. Store PDF Documents in Azure Blob Storage
3. Create a AI Search Service
NOTE:
NOTE:
In this tutorial we will use the Basic tier to explore semantic ranker with Azure AI Search. You can expect a cost of approximately $2.50 per 100 program runs with this tier.
If you plan to use the free tier, please note that the code demonstrated in this tutorial may differ from what you'll need.
(Azure AI Search is priced even when you're not using it. If you're just going through the tutorial for practice, I recommend deleting the Azure AI Search you created when you're done all tutorial.)
4. Connect to Data from Azure Blob Storage
5. Add Cognitive Skills
How to keep sensitive data private?
To ensure the privacy of sensitive data, Azure Cognitive Search provides a Personally Identifiable Information (PII) detection skill. This cognitive skill is specifically designed to identify and protect PII in your data. To learn more about Azure Cognitive Search's PII detection skill, read the following article.
Personally Identifiable Information (PII) Detection cognitive skill
- To enable this feature, select Extract Personally identifiable information.
6. Customize Target Index and Create an Indexer
You can change the fields to suit your data. I have attached a document with a description of each field in the index. (Depending on your settings for the index fields, the code you implement may differ from the tutorial.)
7. Extract Key Phrases for Search Queries Using Azure AI Search
Series 2: Implement a ChatGPT Service with Azure OpenAI
Intent of the Code Design
The primary goal of the code design in this tutorial is to construct the code in a way that is easy to understand, especially for first-time viewers. Each segment of the code is encapsulated as a separate function. This modularization ensures that the main function acts as a clear, coherent guide through the logic of the system.
Ex. Part of the main function. (Semantic Kernel.ver)
async def main():
…
    search_results = await search_documents(QUESTION)
    documents = await filter_documents(search_results)
…
    kernel = await create_kernel(sk)
    await create_embeddings(kernel)
    await create_vector_store(kernel, embeddings)
    await store_documents(kernel, documents)
    related_page = await search_with_vector_store(memory, QUESTION)
    await add_chat_service(kernel)
    answer = await answer_with_sk(kernel, QUESTION, related_page)
…Overview of the code
Part 1: Retrieving and Scoring Documents
We'll use Azure AI Search to retrieve documents related to our question, score them for relevance to our question, and extract documents with a certain score or higher.
Part 2: Document Embedding and Vector Database Storage
We'll embed the documents we extracted in part 1. We will then store these embedded documents in a vector database, organizing them into pages (chunks of 5,000 characters each).
Part 3: Extracting Relevant Content and Implementing a Function to Answer Questions
We will extract the most relevant page from the vector database based on the question.
Then we will implement a function to generate answers from the extracted content.
1. Change your indexer settings to use Azure OpenAI
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/content/pages/*/keyphrases/*",
      "targetFieldName": "keyphrases"
    },
    {
      "sourceFieldName": "/document/content/pages/*",
      "targetFieldName": "pages"
    }
]
2. Create an Azure OpenAI
3. Set up the project and install the libraries
mkdir azure-projcd azure-projmkdir gpt-proj
cd gpt-proj1Python -m venv .venv.venv\Scripts\activate.batpip install openai==1.14.1
pip install langchain==0.1.11
pip install langchain_openai==0.0.8
pip install faiss-cpu==1.8.0
4. Set up the project in VS Code
# Library imports
from collections import OrderedDict
import requests
# Lang Chain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
# Azure AI Search service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'teachchatgpt-search'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-10-01-Preview'
# Azure AI Search service index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'teachchatgpt-index'
# Azure AI Search service semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'teachchatgpt-config'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'teachchatgpt-azureopenai'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2024-02-15-preview'5. Search with Azure AI Search
# Configuration imports
from config import (
    SEARCH_SERVICE_ENDPOINT,
    SEARCH_SERVICE_KEY,
    SEARCH_SERVICE_API_VERSION,
    SEARCH_SERVICE_INDEX_NAME1,
    SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
    AZURE_OPENAI_ENDPOINT,
    AZURE_OPENAI_KEY,
    AZURE_OPENAI_API_VERSION,
)# Azure AI Search service header settings
HEADERS = {
    'Content-Type': 'application/json',
    'api-key': SEARCH_SERVICE_KEY
}
def search_documents(question):
    """Search documents using Azure AI Search."""
    # Construct the Azure AI Search service access URL.
    url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
                SEARCH_SERVICE_INDEX_NAME1 + '/docs')
    
    # Create a parameter dictionary.
    params = {
        'api-version': SEARCH_SERVICE_API_VERSION,
        'search': question,
        'select': '*',
        # '$top': 5, Extract the top 5 documents from your storage.
        '$top': 5,
        'queryLanguage': 'en-us',
        'queryType': 'semantic',
        'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
        '$count': 'true',
        'speller': 'lexicon',
        'answers': 'extractive|count-3',
        'captions': 'extractive|highlight-false'
        }
    # Make a GET request to the Azure AI Search service and store the response in a variable.
    resp = requests.get(url, headers=HEADERS, params=params)
    # Return the JSON response containing the search results.
    search_results = resp.json()
    return search_resultsdef filter_documents(search_results):
    """Filter documents with a reranker score above a certain threshold."""
    documents = OrderedDict()
    for result in search_results['value']:
        # The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
        if result['@search.rerankerScore'] > 0.8:
            documents[result['metadata_storage_path']] = {
                'chunks': result['pages'][:10],
                'captions': result['@search.captions'][:10],
                'score': result['@search.rerankerScore'],
                'file_name': result['metadata_storage_name']
            }
    return documents
def main():
    QUESTION = 'Tell me about effective prompting strategies'
    # Search for documents with Azure AI Search
    search_results = search_documents(QUESTION)
    documents = filter_documents(search_results)
    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))
    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in file_content.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))
# execute the main function
if __name__ == "__main__":
    main()6. Get answers from PDF content using Azure OpenAI and AI Search
Now that Azure AI Search is working well in VS Code, it's time to start using 
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create 
and run a program in example.py file that answers a question with Azure OpenAI based on 
the search information from Azure AI Search.
1. We will create functions related to Azure OpenAI and Lang Chain and run them from 
the main function.
- Add the following code to the example.py file to create the `create_embeddings` function that creates an embedding model.
def create_embeddings():
    """Create an embedding model."""
    embeddings = AzureOpenAIEmbeddings(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version = AZURE_OPENAI_API_VERSION,
        openai_api_type = 'azure',
        azure_deployment = 'text-embedding-ada-002',
        model = 'text-embedding-ada-002',
        chunk_size=1
    )
    return embeddings
- Add the following code to the example.py file to create the `store_documents` function that stores the extracted documents in a vector database.
def store_documents(docs, embeddings):
    """Create vector store and store documents in the vector store."""
    return FAISS.from_documents(docs, embeddings)
- Add the following code to the example.py file to create the `answer_with_langchain` function that extracts the chunk associated with a question stored in the vector database and answers the question based on it.
def answer_with_langchain(vector_store, question):
    """Search for documents related to your question from the vector store
    and answer question with search result using Lang Chain."""
    # Add a chat service.
    llm = AzureChatOpenAI(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version= AZURE_OPENAI_API_VERSION,
        azure_deployment = 'gpt-35-turbo',
        temperature=0.0,
        max_tokens=500
    )
    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vector_store.as_retriever(),
        return_source_documents=True
    )
    answer = chain.invoke({'question': question})
    return answer
2. Modify the `main` function to match this code.
def main():
    QUESTION = 'Tell me about effective prompting strategies'
    # Search for documents with Azure AI Search.
    search_results = search_documents(QUESTION)
    documents = filter_documents(search_results)
    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))
    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in documents.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))
    # Answer your question using Lang Chain.
    embeddings = create_embeddings()
    vector_store = store_documents(docs, embeddings)
    result = answer_with_langchain(vector_store, QUESTION)
    print('Question: ', QUESTION)
    print('Answer: ', result['answer'])
    print('Reference: ', result['sources'].replace(",","\n"))3. Now let's run it and see if it answers your question.
- The result of executing the code.
Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf
NOTE: Full code for example.py and config.py
# Azure AI Search service settings
SEARCH_SERVICE_NAME = 'teachchatgpt-search' # 'teachchatgpt-search'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'lDdgBQSnG4f0uAZixTMgr9nSOa4VvmeVux4Nac9iQXAzSeBm7ToL'
SEARCH_SERVICE_API_VERSION = '2023-10-01-Preview' # '2023-10-01-Preview'
# Azure AI Search service index settings
SEARCH_SERVICE_INDEX_NAME1 = 'teachchatgpt-index' # 'teachchatgpt-index'
# Azure AI Search service semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'teachchatgpt-config' # 'teachchatgpt-config'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'teachchatgpt-azureopenai' # 'teachchatgpt-azureopenai'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'fdfcd358b26343c293d3dff4643f1aee'
AZURE_OPENAI_API_VERSION = '2024-02-15-preview' # '2024-02-15-preview'
# Library imports
from collections import OrderedDict
import requests
# Lang Chain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
# Configuration imports
from config import (
    SEARCH_SERVICE_ENDPOINT,
    SEARCH_SERVICE_KEY,
    SEARCH_SERVICE_API_VERSION,
    SEARCH_SERVICE_INDEX_NAME1,
    SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
    AZURE_OPENAI_ENDPOINT,
    AZURE_OPENAI_KEY,
    AZURE_OPENAI_API_VERSION,
)
# Azure AI Search service header settings
HEADERS = {
    'Content-Type': 'application/json',
    'api-key': SEARCH_SERVICE_KEY
}
def search_documents(question):
    """Search documents using Azure AI Search."""
    # Construct the Azure AI Search service access URL.
    url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
                SEARCH_SERVICE_INDEX_NAME1 + '/docs')
    
    # Create a parameter dictionary.
    params = {
        'api-version': SEARCH_SERVICE_API_VERSION,
        'search': question,
        'select': '*',
        # '$top': 5, Extract the top 5 documents from your storage.
        '$top': 5,
        'queryLanguage': 'en-us',
        'queryType': 'semantic',
        'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
        '$count': 'true',
        'speller': 'lexicon',
        'answers': 'extractive|count-3',
        'captions': 'extractive|highlight-false'
        }
    # Make a GET request to the Azure AI Search service and store the response in a variable.
    resp = requests.get(url, headers=HEADERS, params=params)
    # Return the JSON response containing the search results.
    search_results = resp.json()
    return search_results
def filter_documents(search_results):
    """Filter documents with a reranker score above a certain threshold."""
    documents = OrderedDict()
    for result in search_results['value']:
        # The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
        if result['@search.rerankerScore'] > 0.8:
            documents[result['metadata_storage_path']] = {
                'chunks': result['pages'][:10],
                'captions': result['@search.captions'][:10],
                'score': result['@search.rerankerScore'],
                'file_name': result['metadata_storage_name']
            }
    return documents
def create_embeddings():
    """Create an embedding model."""
    embeddings = AzureOpenAIEmbeddings(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version = AZURE_OPENAI_API_VERSION,
        openai_api_type = 'azure',
        azure_deployment = 'text-embedding-ada-002',
        model = 'text-embedding-ada-002',
        chunk_size=1
    )
    return embeddings
def store_documents(docs, embeddings):
    """Create vector store and store documents in the vector store."""
    return FAISS.from_documents(docs, embeddings)
def answer_with_langchain(vector_store, question):
    """Search for documents related to your question from the vector store
    and answer question with search result using Lang Chain."""
    # Add a chat service.
    llm = AzureChatOpenAI(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version= AZURE_OPENAI_API_VERSION,
        azure_deployment = 'gpt-35-turbo',
        temperature=0.0,
        max_tokens=500
    )
    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vector_store.as_retriever(),
        return_source_documents=True
    )
    answer = chain.invoke({'question': question})
    return answer
def main():
    QUESTION = 'Tell me about effective prompting strategies'
    # Search for documents with Azure AI Search.
    search_results = search_documents(QUESTION)
    documents = filter_documents(search_results)
    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))
    
    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in documents.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))
    # Answer your question using Lang Chain.
    embeddings = create_embeddings()
    vector_store = store_documents(docs, embeddings)
    result = answer_with_langchain(vector_store, QUESTION)
    print('Question: ', QUESTION)
    print('Answer: ', result['answer'])
    print('Reference: ', result['sources'].replace(",","\n"))
# Execute the main function.
if __name__ == "__main__":
    main()