The introduction of Generative AI has accelerated the transformation of more enterprises into AI. How to make LLM better understand the business of enterprises or different scenarios, data is the key. RAG is the most common method now. Limited by the LLM/SLM token, we need to chunking text data. How to maintain the relevance of text context is a technical problem. Microsoft Phi-3.5 supports 128k tokens ,image analysis capabilities and has strong long context understanding capabilities. This blog attempts to use Microsoft Phi-3.5 Family as the core of text chunking to improve the effectiveness and coherence of chunking content.
Starting from the text content
We need to understand the text structure. From daily news, contracts, or papers, they all contain three parts: text, images, and tables. These three parts allow readers to understand the content. How to extract these contents? From a technical perspective, we can combine Python and different AI technologies. For text and images, pypdf can be used, and for tables, Azure Document Intelligence can be used to obtain more accurate content.
The role of Microsoft Phi-3.5
We have split the document into three parts: text, pictures, and tables. We start to use Microsoft Phi-3.5 to understand these three parts.
- Text - We need to divide the text content into knowledge content, which is more conducive to information retrieval. Microsoft Phi-3.5-mini-instruct as a text content reader, summarize and divide the information points. Prompt is as follows:
You are an expert in content chunking. Please help me chunk user's input text according to the following requirements
1. Truncate the text content into chunks of no more than 300 tokens.
2. Each chunk part should maintain contextual coherence. The truncated content should be retained in its entirety without any additions or modifications.
3. Each chunked part is output JSON format { \"chunking\": \"...\" }
4. The final output is a JSON array [{ \"chunking\" : \"...\" },{ \"chunking\" :\"...\"},{ \"chunking\" : \"...\"} ....]
- Images - Images are presented in conjunction with text content. We can use Microsoft Phi-3.5-vision to understand the content of each image in the text. The prompt is as follows:
You are my analysis assistant, help me analyze charts, flowchart, etc. according to the following conditions
1. If it is a chart, please analyze according to the data in the chart and tell me the different details
2. If it is a flowchart, please analyze all the situations in detail according to the flow and describe all process in details, do NOT simplify. Use bullet lists with identation to describe the process
3. The output is json {"chunking":"......"}
4. If it is not a chart or flowchart(more than single node),it does not need to be analyzed, the output is json {"chunking":"NIL"}
- Table - Table is an important content. Through Microsoft Phi-3.5-mini-instruct, you can analyze the data of each table in the content and find the development trend. The prompt is as follows:
You are my markdown table assistant, who can understand all the contents of the table and give analysis.
After we complete the processing of these three parts, we can integrate the data, combine it with the output JSON, and save it into the vector database through Embedding conversion. The vectorized storage of text content is completed.
Solution
We use Prompt flow combined with GitHub Models and Azure AI Search to complete the text chunking. As shown in the figure:
Note
-
We need to ensure that the output is JSON, but Phi-3.5-mini-instruct sometimes has errors, so check it through check_json
-
For different documents, different prompts need to be adjusted, otherwise there is no way to deal with a more accurate understanding of the document content
-
This solution combines Azure AI Search to complete vector storage, and different vector databases can be switched
-
GitHub Models allows us to have more effective calls during the development and verification phase, but it is recommended to switch to Azure or locally deployed models in the production environment.
We can verify the relevant results through Chat-flow
Download the sample code - Click here
Summary
How to complete text chunking more efficiently is an academic issue, and SLM may be one of the methods to solve it. However, we need to do more work to address the diversity of texts, and there is still a lot of room for improvement in this attempt.
Resources
-
Learn more about Microsoft Phi-3.5 https://huggingface.co/microsoft/Phi-3.5-mini-instruct
-
Read Microsoft Phi-3 Cookbook https://aka.ms/phi-3cookbook
-
GitHub Models https://gh.io/models
-
Learn more about Microsoft Prompt flow https://microsoft.github.io/promptflow/
-
Learn more about Microsoft AI Search https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search