Blog Post

Educator Developer Blog
21 MIN READ

Teach ChatGPT to Answer Questions: Using Azure AI Search & Azure OpenAI (Lang Chain)

Minseok_Song's avatar
Minseok_Song
Iron Contributor
Nov 02, 2023

Teach ChatGPT to Answer Questions Based on PDF content: Using Azure AI Search and Azure OpenAI (Lang Chain.ver)

 

 

Spoiler

Semantic Kernel vs. Lang Chain

Readers have the opportunity to explore two different approaches - using either Semantic Kernel or Lang Chain.

For those interested, here's the link to the Semantic Kernel version of this tutorial: Teach ChatGPT to Answer Questions: Using Azure AI Search & Azure OpenAI (Semantic Kernel.ver)

Can't I just copy and paste text from a PDF file to teach ChatGPT?

 

The purpose of this tutorial is to explain how to efficiently extract and use information from large amounts of PDFs. Dealing with a 5-page PDF can be straightforward, but it's a different story when you're dealing with complex documents of 100+ pages. In these situations, the integration of Azure AI Search with Azure OpenAI enables fast and accurate information retrieval and processing. In this tutorial, we handle 5 PDFs, but you can apply this method to scale to handle more than 10,000 files. In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure AI Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI. Here is an overview of this tutorial.

 

 

This tutorial is related to the following topics

 

- AI Engineer
- Developer
- Azure Blob Storage
- Azure AI Search
- Azure OpenAI

 

Learning objectives

 

In this tutorial, you'll learn the following:
- How to store your unstructured data in Azure Blob Storage.
- How to create search experiences based on data stored in Blob Storage with Azure AI Search.
- Learn how to teach ChatGPT to answer questions based on your PDF content using Azure AI search and Azure OpenAI.
 

Prerequisites

 

 

Microsoft Cloud Technologies used in this Tutorial

 

- Azure Blob Storage
- Azure AI Search
- Azure OpenAI Service
 

Table of Contents

 

Series 1: Extract Key Phrases for Search Queries Using Azure AI Search
1. Create a Blob Container
2. Store PDF Documents in Azure Blob Storage
3. Create an AI Search Service
4. Connect to Data from Azure Blob Storage
5. Add Cognitive Skills
6. Customize Target Index and Create an Indexer
7. Extract Key Phrases for Search Queries Using Azure AI Search

Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
2. Create an Azure OpenAI
3. Set up the project and install the libraries
4. Set up the project in VS Code
5. Search with Azure AI Search
6. Get answers from PDF content using Azure OpenAI and AI Search
7. Note: Full code for example.py and config.py
 

Series 1: Extract Key Phrases for Search Queries Using Azure AI Search

 

In Series 1, we'll use Azure AI Search to extract key phrases from unstructured data stored in Azure Blob Storage.
 

This series is designed to guide you through the essential steps of storing, connecting, and searching data in the cloud.
 
Overview of Series 1
 
1. Store Unstructured Data in the Cloud

We'll begin by exploring how to store unstructured data, such as PDFs, in the cloud. This section covers the basics of uploading data in Azure Blob Storage.

2. Connect Stored Data to Azure AI Search

Once our data is stored in the cloud, the next step is to connect that data to Azure AI Search.

3. Search Stored Data

In this final step of Series 1, we'll set up an index on the data you connect to and search based on the data you store in the cloud to retrieve results in Azure AI Search. So we can efficiently retrieve and use the data stored in Azure Blob Storage.
 

1. Create a Blob Container

 

Azure Blob Storage is a service designed for storing large amounts of unstructured data, such as PDFs.

1. To begin, create an Azure Storage account by typing `storage accounts` in the search bar and selecting Services - Storage accounts.


2. Select the +Create button.
 


3. Enter the resource group name that will serve as the folder for the storage account, enter the storage account name, and select a region. When you're done, select the Review button.
 
 
4. Select the Create button.
 

 

2. Store PDF Documents in Azure Blob Storage

 

1. After your storage account is set up, navigate to Storage Browser by typing `storage browser` in the search bar.

 

2. In the Storage Browser, select the blob storage you created.
 
 
3. Add a new container to store PDF documents.

 

- Select your storage account.
- Select the Blob containers button.
- Select the +Add Container button to create a new container.

3. Once the container is set up, upload your PDFs into this container.

- Select the container you created.
 

 

- Select the Upload button and upload your PDF documents. 
 
For the tutorial, I downloaded 5 PDF documents of recent papers on GPT from Microsoft Academic and uploaded them to the container.
 

 

3. Create a AI Search Service

 

1. Type `ai search` into the search bar and select Services – AI Search.
 

2. Select the +Create button.

3. Create a new AI Search Service.

- Select your Resource Group.
- Specify the Service name.
- Select your Location.
 

NOTE:

Azure OpenAI resource is currently available in limited regions.
If Azure OpenAI resource is not available in your region, I recommend setting your location to East US.

- Choose a Pricing tier that suits your needs; since semantic ranker is available from the basic tier, I recommend setting your Pricing tier to basic for the tutorial.


NOTE:

In this tutorial we will use the Basic tier to explore semantic ranker with Azure AI Search. You can expect a cost of approximately $2.50 per 100 program runs with this tier.

If you plan to use the free tier, please note that the code demonstrated in this tutorial may differ from what you'll need.

(Azure AI Search is priced even when you're not using it. If you're just going through the tutorial for practice, I recommend deleting the Azure AI Search you created when you're done all tutorial.)


- Select the Review + create button.
 

 
4. Navigate to the AI Search service you created and select Semantic ranker, then select the Free plan. (If you choose the free tier, you can skip it.)
  
 

 4. Connect to Data from Azure Blob Storage


1. Navigate to the AI Search service you created and select Import data.
 

 

2. Select Azure Blob Storage as the data source and connect it to the Blob Storage where your PDFs are stored.
 
 
3. Specify your Data source name.

4. Select Choose an existing connection and select the blob storage container you created.

5. Select Next: Add cognitive skills button.
 

5. Add Cognitive Skills


1.  Attach AI Services
- To power your cognitive skills, select an existing AI Services resource or create a new one; the Free resource(default) is sufficient for this tutorial.
 

2. Specify the Skillset name.
 
TIP:
If you want to search for text in a photo, you need to check Enable OCR and merge all text into merged_content field. In this tutorial, we will not check it because we will search based on the text in the paper.

3. Select Enrichment granularity level. (In this tutorial, we'll use a page-by-page granularity, so we'll select Pages (5000 characters chunk).)

4. Select Extract Key phrases. (You can select additional checkboxes depending on the type of PDF data.)

5. Select Next: Customize target index button.
 
 

NOTE:
Why set the Enrichment granularity level to Pages (5000 characters chunk)?
To get ChatGPT responses based on a PDF, We need to call the GPT-3.5-turbo model of ChatGPT API. The GPT-3.5-turbo model can handle up to 4096 tokens, including both the text you use as input and the length of the answer the ChatGPT API returns. For this reason, documents that are too long cannot be entered all at once, but must be broken into multiple chunks and processed after multiple calls to the ChatGPT API. (Tokens can be words, punctuation, spaces, etc.)
 
TIP:

How to keep sensitive data private?

 

To ensure the privacy of sensitive data, Azure Cognitive Search provides a Personally Identifiable Information (PII) detection skill. This cognitive skill is specifically designed to identify and protect PII in your data. To learn more about Azure Cognitive Search's PII detection skill, read the following article.

Personally Identifiable Information (PII) Detection cognitive skill

 

- To enable this feature, select Extract Personally identifiable information.

 

6. Customize Target Index and Create an Indexer


1. Customize target Index.

- Specify your Index name.
- Check the boxes as shown in the image below.

TIP:

You can change the fields to suit your data. I have attached a document with a description of each field in the index. (Depending on your settings for the index fields, the code you implement may differ from the tutorial.)

Index data from Azure Blob Storage

 
 
2. Add a new field.

In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to create a field to search for pages that are separated by 5000 character chunks.

- Select + Add field button.
- Create a field named `pages`.
- Select Collection(Edm.String) as the type for the `pages` field.
- Check the box Retrievable.
 
3. Delete unnecessary fields.
 
 
4. Create an Indexer.
 
- Specify your Indexer Name.
- Select the Schedule – Once.
(For data coming in in real time, you'll need to set up a schedule periodically, but since we're dealing with unchanging PDF data in this tutorial, we'll only need to schedule Once.)
- Select the Submit button.
 
 

7. Extract Key Phrases for Search Queries Using Azure AI Search

 

1. Once your indexer and index creation are complete, navigate to your AI Search service and select the Indexes page.
 
2. Select the index you created.
 
 
3. You can use a query string or simply enter text to perform a search.
Ex) In this tutorial, I entered the following question: `How to prompt GPT to be reliable?`
 
 
4. Set Semantic configurations.
- Semantic configurations are available from the basic price tier onwards. If you chose the free tier, you can skip it.
- Select Semantic configurations, then select + Add semantic configuration.
 
 
- Specify your semantic configuration Name.
- Select the Title field – content.
- Select the Save button.
 
 
- When you've finished setting up your semantic configuration, return and select the Save button.
 

 
We completed extracting key phrases based on our questions using Azure AI Search.
In the next series, we'll connect this AI Search service with Azure Open AI to make a ChatGPT that answers questions based on PDFs stored in blob storage.
 

Series 2: Implement a ChatGPT Service with Azure OpenAI

In series2, we will implement the feature to answer questions based on PDFs using Azure AI Search and Azure OpenAI, and implement this feature in code.
 

 

Intent of the Code Design

The primary goal of the code design in this tutorial is to construct the code in a way that is easy to understand, especially for first-time viewers. Each segment of the code is encapsulated as a separate function. This modularization ensures that the main function acts as a clear, coherent guide through the logic of the system.

 

Ex. Part of the main function. (Semantic Kernel.ver)

 

async def main():
…
    search_results = await search_documents(QUESTION)

    documents = await filter_documents(search_results)
…
    kernel = await create_kernel(sk)

    await create_embeddings(kernel)

    await create_vector_store(kernel, embeddings)

    await store_documents(kernel, documents)

    related_page = await search_with_vector_store(memory, QUESTION)

    await add_chat_service(kernel)

    answer = await answer_with_sk(kernel, QUESTION, related_page)
…
 

Overview of the code

 

Part 1: Retrieving and Scoring Documents

We'll use Azure AI Search to retrieve documents related to our question, score them for relevance to our question, and extract documents with a certain score or higher.

 

Part 2: Document Embedding and Vector Database Storage

We'll embed the documents we extracted in part 1. We will then store these embedded documents in a vector database, organizing them into pages (chunks of 5,000 characters each).

 

What is Embedding?

What is a Vector Database?

 

Part 3: Extracting Relevant Content and Implementing a Function to Answer Questions

We will extract the most relevant page from the vector database based on the question.

Then we will implement a function to generate answers from the extracted content.

 

1. Change your indexer settings to use Azure OpenAI

 

1. navigate to your AI Search service and select the Indexers page.
 
2. Select the indexer you created.
 
 
3. Select the Indexer Definition (JSON)
 
4. In the JSON, modify the "outputFieldMappings" part as shown below.
 
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/content/pages/*/keyphrases/*",
      "targetFieldName": "keyphrases"
    },
    {
      "sourceFieldName": "/document/content/pages/*",
      "targetFieldName": "pages"
    }
]
 

5. Select the Save button.
 
NOTE:
You must click the Save button with your mouse. Using the shortcut Ctrl + S doesn't actually save your changes, it just changes the color of the icon.
 
6. Select the Reset button.
 
7. Select the Run button.
 

 

TIP:
Description of “outputFieldMappings”
"outputFieldMappings" are settings that map data processed by the Cognitive Search service to specific fields in the index.
For example, in the path "/document/content/pages//keyphrases/", keywords are extracted from each page and mapped to the "keyphrases" field.
>Similarly, for the “pages” field that we created earlier, we need to specify what data will be mapped to this field. In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to specify that 5000-character chunks from "/document/content/pages/" are mapped to the "pages" field. We need to add JSON code to map the data to the "pages" field so that we can send chunks of 5000 characters to OpenAI instead of sending entire pages.
 

2. Create an Azure OpenAI


Currently, access to the Azure OpenAI service is granted by request only. You can request access to the Azure OpenAI service by filling out the form at https://aka.ms/oai/access/ .

1. Type 'azure openai' in the search bar and select Services - Azure OpenAI.
 
 
2. Select the + Create button.
 
 
3. Fill in Basics.
 
 
NOTE:
Azure OpenAI resource is currently available in limited regions.
If Azure OpenAI resource is not available in your region, I recommend setting your location to East US.
 
4. Select a network security Type.
 

 
5. Select the Create button.
 
 
6. Deploy your Azure OpenAI model.
 
- Navigate to your Azure OpenAI, then Select the Go to Azure OpenAI Studio button.
 
 
- In Azure openAI Studio, select the Deployments button.
 
 
- Select the + Create new deployment button, then create the gpt-35-turbo and text-embedding-ada-002 models

NOTE:
In this tutorial we will use the gpt-35-turbo and text-embedding-ada-002 models. I recommend using the same name for both the deployment name and the model name.
 
 
 

3. Set up the project and install the libraries


1. Create a folder where you can work.

- We will create an `azure-proj` folder inside the `User` folder and work inside the `gpt-proj1` folder.
- Open a command prompt window and create a folder named `azure-proj` in the default path.
 
mkdir azure-proj

- Navigate to the `azure-proj` folder you just created.
 
cd azure-proj

- In the same way, create a `gpt-proj1` folder inside the `azure-proj` folder. Navigate to the `gpt-proj1` folder.
 
mkdir gpt-proj
cd gpt-proj1
 
 
2. Create a virtual environment.

- Type the following command to create a virtual environment named `.venv`.
 
Python -m venv .venv

- Once the virtual environment is created, type the following command to activate the virtual environment.
 
.venv\Scripts\activate.bat

- Once activated, the name of the virtual environment will appear on the far left of the command prompt window.
 
 
3. Install the required packages.
- At the Command prompt, type the following command.
 
pip install openai==1.14.1
pip install langchain==0.1.11
pip install langchain_openai==0.0.8
pip install faiss-cpu==1.8.0
 
TIP:
How to use CMD in VS Code
Select TERMINAL at the bottom of VS Code, then select the + button, then select the Command Prompt.
 

 4. Set up the project in VS Code


1. In VS Code, select the folder that you have created.

- Open VS Code and select File > Open Folder from the menu. Select the gpt-proj1 folder that you created earlier, which is located at C:\Users\yourUserName\azure-proj\gpt-proj1.
 
 
2. Create a new file.

- In the left pane of VS Code, right-click and select New File to create a new file named example.py.
 
 
3. Import the required packages.

- Type the following code in the example.py file in VS Code.
 
# Library imports
from collections import OrderedDict
import requests

# Lang Chain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS

4. Create a configuration file - config.py file.

NOTE:
Complete folder structure:
└── YourUserName
         └── azure-proj
            └── gpt-proj1
                  ├── example.py
                  └── config.py

- Create a config.py file. This file should contain information about your Azure.
- Add the code below to your config.py file.
 
# Azure AI Search service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'teachchatgpt-search'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-10-01-Preview'

# Azure AI Search service index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'teachchatgpt-index'

# Azure AI Search service semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'teachchatgpt-config'

# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'teachchatgpt-azureopenai'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2024-02-15-preview'
 
5. Fill in your config.py file with your Azure information.

NOTE:
You'll need to include information about your Azure AI Search Service name, index name, semantic configuration name, key, and API version, and Azure OpenAI name, key, and API version.

TIP:
Find your Azure information
1. Find the Azure AI Search Keys.
 
- Navigate to your AI Search service, then select Keys, then copy and paste your key into the config.py file.
 
 
2. Find the Azure AI Search Index name.
 
- Navigate to your AI Search service, then select Indexes, then copy and paste your index name into the config.py file.
 
 
3. Find the Azure AI Search Semantic configuration name.
- Navigate to your AI Search service, select Indexes, and then click your index name.
- Select Semantic configurations and copy and paste your Semantic configuration name into the config.py file.
 
 
4. Find the Azure OpenAI Keys.
 
- Navigate to your Azure OpenAI, then select Keys and Endpoint, then copy and paste your key into the config.py file
 
 
5. Choose your Azure AI Search API and Azure OpenAI version.
 
- Select your version of the Azure AI Search API and Azure OpenAI API using the hyperlinks below.
- I have selected the latest version of the Azure AI Search API, 2023-10-01-Preview, and the Azure OpenAI API, 2024-02-15-Preview.
 

5. Search with Azure AI Search

 

In this section, we'll use Azure Cognitive Search within VS Code. We have already installed all the necessary packages in the previous chapter. Now we will focus on how to use Azure AI Search and Azure OpenAI in VS Code.
In Chapters 5 and 6, we'll create functions that use Azure AI Service and Azure OpenAI and use them in example.py.
To use Azure AI Search and Azure OpenAI, we need to import the information from Azure that we entered in config.py into example.py that we created earlier.
All the following code comes from example.py.
The full code is provided at the end of the chapter for your convenience.
 
1. Add code to example.py that imports the values from config.py.
 
# Configuration imports
from config import (
    SEARCH_SERVICE_ENDPOINT,
    SEARCH_SERVICE_KEY,
    SEARCH_SERVICE_API_VERSION,
    SEARCH_SERVICE_INDEX_NAME1,
    SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
    AZURE_OPENAI_ENDPOINT,
    AZURE_OPENAI_KEY,
    AZURE_OPENAI_API_VERSION,
)
 
2. Add the Azure AI Search Service header.
 
# Azure AI Search service header settings
HEADERS = {
    'Content-Type': 'application/json',
    'api-key': SEARCH_SERVICE_KEY
}

3. Now, we will create functions related to Azure AI Search and run them from the main function.

-  Add the following code to the example.py file to create the `search_documents` function that retrieves documents related to your question.
 

def search_documents(question):
    """Search documents using Azure AI Search."""
    # Construct the Azure AI Search service access URL.
    url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
                SEARCH_SERVICE_INDEX_NAME1 + '/docs')
    
    # Create a parameter dictionary.
    params = {
        'api-version': SEARCH_SERVICE_API_VERSION,
        'search': question,
        'select': '*',
        # '$top': 5, Extract the top 5 documents from your storage.
        '$top': 5,
        'queryLanguage': 'en-us',
        'queryType': 'semantic',
        'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
        '$count': 'true',
        'speller': 'lexicon',
        'answers': 'extractive|count-3',
        'captions': 'extractive|highlight-false'
        }
    # Make a GET request to the Azure AI Search service and store the response in a variable.
    resp = requests.get(url, headers=HEADERS, params=params)
    # Return the JSON response containing the search results.
    search_results = resp.json()

    return search_results
 
- Add the following code to the example.py file to create the `filter_documents` function that extracts documents with a certain score or higher based on how relevant it is to your question.
 
def filter_documents(search_results):
    """Filter documents with a reranker score above a certain threshold."""
    documents = OrderedDict()
    for result in search_results['value']:
        # The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
        if result['@search.rerankerScore'] > 0.8:
            documents[result['metadata_storage_path']] = {
                'chunks': result['pages'][:10],
                'captions': result['@search.captions'][:10],
                'score': result['@search.rerankerScore'],
                'file_name': result['metadata_storage_name']
            }

    return documents
 
4. Now we'll run the code using the above functions in the main function.

Add the following code to the example.py file to create the `main` function.
- When you run it, you'll see the total number of PDFs in the blob storage, the top few documents adopted, and the number of chunks.
- I asked the question, 'Tell me about effective prompting strategies' based on the paper I had stored on the blob storage.
- If you want to see the full search results, add `print(search_results)` to your main function.

 

def main():

    QUESTION = 'Tell me about effective prompting strategies'

    # Search for documents with Azure AI Search

    search_results = search_documents(QUESTION)

    documents = filter_documents(search_results)

    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))


    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in file_content.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))

# execute the main function
if __name__ == "__main__":
    main()

6. Get answers from PDF content using Azure OpenAI and AI Search

 

Now that Azure AI Search is working well in VS Code, it's time to start using
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create
and run a program in example.py file that answers a question with Azure OpenAI based on
the search information from Azure AI Search.


1. We will create functions related to Azure OpenAI and Lang Chain and run them from
the main function.


Add the following code to the example.py file to create the `create_embeddings` function that creates an embedding model.

 

def create_embeddings():
    """Create an embedding model."""
    embeddings = AzureOpenAIEmbeddings(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version = AZURE_OPENAI_API_VERSION,
        openai_api_type = 'azure',
        azure_deployment = 'text-embedding-ada-002',
        model = 'text-embedding-ada-002',
        chunk_size=1
    )
    return embeddings

 

Add the following code to the example.py file to create the `store_documents` function that stores the extracted documents in a vector database.

 


def store_documents(docs, embeddings):
    """Create vector store and store documents in the vector store."""
    return FAISS.from_documents(docs, embeddings)

 

-  Add the following code to the example.py file to create the `answer_with_langchain` function that extracts the chunk associated with a question stored in the vector database and answers the question based on it.

 

def answer_with_langchain(vector_store, question):
    """Search for documents related to your question from the vector store
    and answer question with search result using Lang Chain."""

    # Add a chat service.
    llm = AzureChatOpenAI(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version= AZURE_OPENAI_API_VERSION,
        azure_deployment = 'gpt-35-turbo',
        temperature=0.0,
        max_tokens=500
    )

    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vector_store.as_retriever(),
        return_source_documents=True
    )

    answer = chain.invoke({'question': question})

    return answer


2. Modify the `main` function to match this code.

 
def main():

    QUESTION = 'Tell me about effective prompting strategies'

    # Search for documents with Azure AI Search.

    search_results = search_documents(QUESTION)

    documents = filter_documents(search_results)

    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))


    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in documents.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))

    # Answer your question using Lang Chain.

    embeddings = create_embeddings()

    vector_store = store_documents(docs, embeddings)

    result = answer_with_langchain(vector_store, QUESTION)

    print('Question: ', QUESTION)
    print('Answer: ', result['answer'])
    print('Reference: ', result['sources'].replace(",","\n"))
 

3. Now let's run it and see if it answers your question.
- The result of executing the code.

Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf

 

NOTE: Full code for example.py and config.py

 
This chapter is designed to provide all the code used in the tutorial. It is a separate section from the rest of the tutorial.
For your convenience, I've attached the full code used in the tutorial.
 
1. config.py
 
# Azure AI Search service settings
SEARCH_SERVICE_NAME = 'teachchatgpt-search' # 'teachchatgpt-search'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'lDdgBQSnG4f0uAZixTMgr9nSOa4VvmeVux4Nac9iQXAzSeBm7ToL'
SEARCH_SERVICE_API_VERSION = '2023-10-01-Preview' # '2023-10-01-Preview'

# Azure AI Search service index settings
SEARCH_SERVICE_INDEX_NAME1 = 'teachchatgpt-index' # 'teachchatgpt-index'

# Azure AI Search service semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'teachchatgpt-config' # 'teachchatgpt-config'

# Azure OpenAI settings
AZURE_OPENAI_NAME = 'teachchatgpt-azureopenai' # 'teachchatgpt-azureopenai'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'fdfcd358b26343c293d3dff4643f1aee'
AZURE_OPENAI_API_VERSION = '2024-02-15-preview' # '2024-02-15-preview'
 
 
2. example.py
 
# Library imports
from collections import OrderedDict
import requests

# Lang Chain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS

# Configuration imports
from config import (
    SEARCH_SERVICE_ENDPOINT,
    SEARCH_SERVICE_KEY,
    SEARCH_SERVICE_API_VERSION,
    SEARCH_SERVICE_INDEX_NAME1,
    SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
    AZURE_OPENAI_ENDPOINT,
    AZURE_OPENAI_KEY,
    AZURE_OPENAI_API_VERSION,
)

# Azure AI Search service header settings
HEADERS = {
    'Content-Type': 'application/json',
    'api-key': SEARCH_SERVICE_KEY
}

def search_documents(question):
    """Search documents using Azure AI Search."""
    # Construct the Azure AI Search service access URL.
    url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
                SEARCH_SERVICE_INDEX_NAME1 + '/docs')
    
    # Create a parameter dictionary.
    params = {
        'api-version': SEARCH_SERVICE_API_VERSION,
        'search': question,
        'select': '*',
        # '$top': 5, Extract the top 5 documents from your storage.
        '$top': 5,
        'queryLanguage': 'en-us',
        'queryType': 'semantic',
        'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
        '$count': 'true',
        'speller': 'lexicon',
        'answers': 'extractive|count-3',
        'captions': 'extractive|highlight-false'
        }
    # Make a GET request to the Azure AI Search service and store the response in a variable.
    resp = requests.get(url, headers=HEADERS, params=params)
    # Return the JSON response containing the search results.
    search_results = resp.json()

    return search_results

def filter_documents(search_results):
    """Filter documents with a reranker score above a certain threshold."""
    documents = OrderedDict()
    for result in search_results['value']:
        # The '@search.rerankerScore' range is 0 to 4.00, where a higher score indicates a stronger semantic match.
        if result['@search.rerankerScore'] > 0.8:
            documents[result['metadata_storage_path']] = {
                'chunks': result['pages'][:10],
                'captions': result['@search.captions'][:10],
                'score': result['@search.rerankerScore'],
                'file_name': result['metadata_storage_name']
            }

    return documents

def create_embeddings():
    """Create an embedding model."""
    embeddings = AzureOpenAIEmbeddings(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version = AZURE_OPENAI_API_VERSION,
        openai_api_type = 'azure',
        azure_deployment = 'text-embedding-ada-002',
        model = 'text-embedding-ada-002',
        chunk_size=1
    )
    return embeddings

def store_documents(docs, embeddings):
    """Create vector store and store documents in the vector store."""
    return FAISS.from_documents(docs, embeddings)

def answer_with_langchain(vector_store, question):
    """Search for documents related to your question from the vector store
    and answer question with search result using Lang Chain."""

    # Add a chat service.
    llm = AzureChatOpenAI(
        openai_api_key = AZURE_OPENAI_KEY,
        azure_endpoint = AZURE_OPENAI_ENDPOINT,
        openai_api_version= AZURE_OPENAI_API_VERSION,
        azure_deployment = 'gpt-35-turbo',
        temperature=0.0,
        max_tokens=500
    )

    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vector_store.as_retriever(),
        return_source_documents=True
    )

    answer = chain.invoke({'question': question})

    return answer

def main():
    QUESTION = 'Tell me about effective prompting strategies'

    # Search for documents with Azure AI Search.

    search_results = search_documents(QUESTION)

    documents = filter_documents(search_results)

    print('Total Documents Found: {}, Top Documents: {}'.format(
        search_results['@odata.count'], len(search_results['value'])))

    
    # 'chunks' is the value that corresponds to the Pages field that you set up in the AI Search service.
    docs = []
    for key,value in documents.items():
        for page in value['chunks']:
            docs.append(Document(page_content = page,
                                metadata={"source": value["file_name"]}))

    # Answer your question using Lang Chain.

    embeddings = create_embeddings()

    vector_store = store_documents(docs, embeddings)

    result = answer_with_langchain(vector_store, QUESTION)

    print('Question: ', QUESTION)
    print('Answer: ', result['answer'])
    print('Reference: ', result['sources'].replace(",","\n"))

# Execute the main function.
if __name__ == "__main__":
    main()

Congratulations!

You've completed this tutorial

Congratulations! You've learned the integration tutorial of Azure AI Search with Azure OpenAI
 

In this tutorial, we have navigated through a practical journey of integrating Azure Blob Storage, Azure AI Search, and Azure OpenAI to create a powerful search and response mechanism.

1. Storing Data in Azure Blob Storage

Our first step was to efficiently store PDF files in Azure Blob Storage, an unstructured data store known for its scalability and security. This storage served as a foundational base, housing the search material that would later be indexed and queried to retrieve relevant information.

2. Implementing Azure AI Search

In the next step, we used Azure AI Search to search based on the data we had stored in Azure Blob Storage. This powerful service was instrumental in indexing and searching the data stored in Azure Blob Storage.

3. Integrating Azure OpenAI with VS Code

The final step of our tutorial was to integrate Azure OpenAI through a program created in VS Code. This program was designed to use the search information processed and refined by Azure AI Search to generate accurate and contextually relevant answers. The synergy between these technologies illustrated the seamless interplay of storage, search, and response mechanisms.
I hope that the knowledge and skills imparted will serve as invaluable tools in your future projects. The harmonious integration of Azure Blob Storage, Azure AI Search, and Azure OpenAI represents the pinnacle of unstructured data management and utilization.
Thank you for your commitment and hard work throughout this learning journey.
 
 
 

Clean Up Azure Resources

Cleanup your Azure resources to avoid additional charges to your account. Go to the Azure portal and delete the following resources:
 
- The Azure AI Search resource
- The Azure Storage resource
- The Azure OpenAI resource

Next Steps


Documentation


Training Content

Updated Mar 26, 2024
Version 11.0
  • AmitNilajkar's avatar
    AmitNilajkar
    Copper Contributor

    Thank you,  What options do I have which pdf data chat is accessible, let's say I don't want someone query sensitive data  within organisation which is meant for specific users or department related.   Also certain pdf i wanto chat on the fly and not store on the blob.  Any suggestion appreciated.

  • Minseok_Song's avatar
    Minseok_Song
    Iron Contributor

    AmitNilajkar  Thank you for your question.

     

    In this answer, I'm going to talk about how to keep sensitive data private and how to chat on PDF without saving the PDF to storage.

     

    1. How to keep sensitive data private

     

    Cognitive Search's PII feature allows you to keep sensitive data private.

     

    The following article describes the PII features of Cognitive Search. You may find this article useful.

    Personally Identifiable Information (PII) Detection cognitive skill

     

    1. Create a new Cognitive Search Service.


    2. select +Import data.


    3. Navigate to Add Cognitive Skills 

     

    You can refer to my posts in the Series:1, 5. Add Cognitive Skills.

     

    - Select Extract Key phrases and Extract Personally identifiable information.

     Selecting Extract personally identifiable information allows you to detect sensitive organizational and individual data and make it private.

     

    4.  Change the index settings as shown below.

     

     

    5. Navigate to the Indexers page and select the indexer you created.

     

     

    6. In the indexer's JSON, modify the outputFileMappings as shown below.

     

     

     

    "outputFieldMappings": [
            {
          "sourceFieldName": "/document/content/pii_entities",
          "targetFieldName": "pii_entities"
        },
        {
          "sourceFieldName": "/document/content/masked_text",
          "targetFieldName": "masked_text"
        },
       {
        "sourceFieldName": "/document/content/pages/*/keyphrases/*",
        "targetFieldName": "keyphrases"
      },
      {
        "sourceFieldName": "/document/content/pages/*",
        "targetFieldName": "pages"
      }
    ]

     

     

     

     

    The rest of the steps are exactly the same.

     

    Below is an example of a sentence that would appear if sensitive data were private.

    "Microsoft employee with ssn *********** is using our awesome API's."

     

    how to chat on PDF without saving the PDF to storage

     

    1. if you are currently using ChatGPT's PLUS Plan
    you can attach a PDF directly from ChatGPT's Advanced Data Analysis mode.

     

    2. If not, you can also use private services that people have created.

     

    Is this the answer you were looking for? 

     

    Please leave a comment if you have any questions. Thank you.

     

  • maxmtl's avatar
    maxmtl
    Copper Contributor

    Thank you for your tutorial. I have a few questions I hope you can help me with. I did all the steps and create the python files. Once I run the example.py it pauses for a few seconds then returns to the prompt line (see image). I uploaded a PDF with questions and answers on our business services and want to be able to ask Azure Chatgpt questions and use the data in the PDF. Can I add more than one PDF documents?

    I don't understand how to interact with the PDF once I did all these steps. Do I create a chatbox, do I train the data ?

     

     

    Thanks for your help.

     

    Max

     

  • Minseok_Song's avatar
    Minseok_Song
    Iron Contributor

    Thank you for sharing your issue maxmtl.

    I am surprised to hear that running the code does not produce any results. First, I'd like to know if this is only happening in this tutorial code, or if it's happening in all Python code. If it's not just  this tutorial code, but other code as well, then it's most likely a local environment problem. As an experiment, please create another Python file and run print("helloWorld") as shown below and share your results.


    And I need some time to organize the answers to your questions.
    Your question is something I thought about a lot when I first designed this tutorial, so I'll try to organize it and get back to you in a few days. Thanks again.

  • prakashpatidar's avatar
    prakashpatidar
    Copper Contributor

    Thanks for making detailed doc. While creating app using this reference,

    I am facing below exception while executing code:

    def store_documents(docs, embeddings😞
    """Create vector store and store documents in the vector store"""
    return FAISS.from_documents(docs, embeddings)

     

    Exception has occurred: NotFoundError
    •  
    Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
    httpx.HTTPStatusError: Client error '404 Resource Not Found' for url 'https://azureai8febeus.openai.azure.com/embeddings' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404 During handling of the above exception, another exception occurred: File "/Users/prakash.patidar/ls/example2.py", line 83, in store_documents return FAISS.from_documents(docs, embeddings) File "/Users/prakash.patidar/ls/example2.py", line 133, in main vector_store = store_documents(docs, embeddings) File "/Users/prakash.patidar/ls/example2.py", line 143, in <module> main() openai.NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
     
    I have deployed ai instance in East US and created below deployments 

    text-embedding-ada-002

    gpt-35-turbo

    keys and deployment name etc looks good but stil getting resource not found error. I have checked with both api versions here in value and comment both 

    AZURE_OPENAI_API_VERSION = '2023-12-01-preview' # '2023-08-01-preview'.
    Any help will be highly appriciated
     
    pip list in my case :
    Package Version
    ------------------- ----------
    aiohttp 3.9.3
    aiosignal 1.3.1
    annotated-types 0.6.0
    anyio 4.2.0
    async-timeout 4.0.3
    attrs 23.2.0
    certifi 2024.2.2
    charset-normalizer 3.3.2
    dataclasses-json 0.6.4
    distro 1.9.0
    exceptiongroup 1.2.0
    faiss-cpu 1.7.4
    frozenlist 1.4.1
    h11 0.14.0
    httpcore 1.0.3
    httpx 0.26.0
    idna 3.6
    jsonpatch 1.33
    jsonpointer 2.4
    langchain 0.1.7
    langchain-community 0.0.20
    langchain-core 0.1.23
    langsmith 0.0.87
    marshmallow 3.20.2
    multidict 6.0.5
    mypy-extensions 1.0.0
    numpy 1.26.4
    openai 1.12.0
    packaging 23.2
    pip 22.2.1
    pydantic 2.6.1
    pydantic_core 2.16.2
    PyYAML 6.0.1
    regex 2023.12.25
    requests 2.31.0
    setuptools 63.2.0
    sniffio 1.3.0
    SQLAlchemy 2.0.27
    tenacity 8.2.3
    tiktoken 0.6.0
    tqdm 4.66.2
    typing_extensions 4.9.0
    typing-inspect 0.9.0
    urllib3 2.2.0
    yarl 1.9.4
  • Minseok_Song's avatar
    Minseok_Song
    Iron Contributor

    prakashpatidar Thank you for sharing your issue.
    I reinstalled the packages and tried to run it and found that I got the same error. I'm currently looking into a version issue with this tutorial and working to resolve it. It looks like the problem you're experiencing is also a package version issue, which I'll get to the bottom of and improve the code for the new version.

  • Minseok_Song's avatar
    Minseok_Song
    Iron Contributor

    This blog post has been updated to reflect the latest versions of langchain and openai as of March 11, 2024.

  • learning111's avatar
    learning111
    Copper Contributor

    Thanks for the tutorial. Could you please tell me how to get a response based on the previous conversation. 
    Like after getting a response from the chat model, when I am asking the next question as "Could you please summarize your previous response", I am getting an error "IndexError: list index out of range". I am storing my conversation history in a cosmos db.

  • Minseok_Song's avatar
    Minseok_Song
    Iron Contributor

    Thank you for contacting me with your problem  learning111.

     

    However, the situation you've described is different from the context of my post, making it difficult for me to provide a precise solution.

     

    Thank you for your understanding.