Event details
Retrieval-augmented generation (RAG) allows you to build GenAI applications that use your own data, to optimize LLM performance.
Join our AMA to ask us about RAG, vector databases, running RAG...
EricStarker
Updated Feb 14, 2024
Steve Jones
Feb 14, 2024Copper Contributor
How would we get started with a public set of data, say on a public website, as opposed to building something that might be private/semi private with data for authorized users? (internal or customers)
- fsunavala-msftFeb 14, 2024
Microsoft
Here's a high-level flow you can follow to get started:
- Identify the public data source: Identify the public website or dataset that you intend to use. Ensure the data is publicly accessible and adheres to legal and ethical guidelines regarding its use.
- Data Acquisition: If the data is on a website, you might need to use web scraping techniques to extract the data. Alternatively, if the website offers an API, you can use it to fetch the data more efficiently in a structured format. Additionally, if you are using a dataset from a catalog like Hugging Face, they have clear instructions for downloading the data.
- Prepare the data: It’s a good practice to clean and preprocess the data to ensure it is in a suitable format for indexing. This usually includes removing any unnecessary info, converting data formats, whitespace cleanup, etc. Then, you’d want you to define a schema for you Azure AI Search index that matches the structure of your data. This includes specifying fields and their data types, as well as configuring any searchable, filterable sortable, or facetable attributes.
- Indexing the Data: You can use the Azure Portal, Azure CLI, or Client SDKs, to create a new search index based on the schema you defined. You can ingest data into Azure AI search in two ways: Push API or Pull via Indexer. Data import and data ingestion - Azure AI Search | Microsoft Learn
- Query Your Index: You can now begin to create search experiences by searching your AI Search index. You can do simple keyword full-text search queries, vector search queries, or hybrid search queries. Search Documents - Azure AI Search | Microsoft Learn
- Access Control and Privacy for Public Data: Since the data is public, you might not need stringent access controls but if you need them, you can implement RBAC and leverage security filter trimming. Security overview - Azure AI Search | Microsoft Learn and Security filters to trim results using MIcrosoft Entra ID - Azure AI Search | Microsoft Learn
Hope this helps!