We have a use case to extract the information from various types of documents like Excel, PDF, and Word and convert it into structured information. The data exists in different formats. We started building this use case with AI Builder, and we hit the roadblock and are now exploring ways using the Co-pilot studio. It would be great if someone could point us in the right direction.What should be the right technology stack that we should consider for this use case? Thank you for the pointer.

For extracting structured data from Excel, PDF, and Word, consider Azure Form Recognizer, Power Automate, and Copilot Studio for automation. If AI Builder falls short, use Azure Cognitive Services or Python (Pandas, PyPDF2, OpenPyXL) for better control. Storing data in Dataverse or Synapse can help with structuring.Learn more about https://thefifamobile.com/ here! 🚀

Extracting insights from documents is key to analytical as well as research (digital) products. Information extraction can be broadly classified as 2 different tasks: i. text extraction from document ii. entity extraction from the free text.For extracting insights from PDFs, Images, etc, we can use Azure AI Document Intelligence service as its powerful in extracting all the text (OCR capability) from PDFs, Images, etc. as well as extracting specific entities as per business need. However, for only extracting entities from free text, Named Entity Recognition under Azure AI > Language service will be very helpful. Based on the business use case, we can decide which service will better fit.

You could apply 'classical' NLP techniques like Entity recognition. In Power Automate you have AI Builder that comes with out of the box model. You also have an option to train your own custom model. If you could show some examples, I would be happy to guide you. I have experience with similar situations, in extracting structured data from free format unstructured text in invoice descriptions.

Thank you for the pointer, Jenapravat. As I write, from the experiment we did, we observed that the Azure AI document service is yielding a great result. However, our initial observation says that it requires manual mapping for each different format to extract the text with accuracy. Ideally, we are looking for a model that can recognize the similar terms from the various document types and start extracting the accurate value without manual mapping. For example, a diameter in one document could be named as the final diameter in another document. In other words, a model should be able to train itself. Is this something you have experience working with that and can guide us?

Have you tried creating a custom Named Entity Recognition model from Azure AI Language service?

Using AI to convert unstructured information to structured information

18 Replies

jenapravat
Copper Contributor
Feb 13, 2025
Extracting insights from documents is key to analytical as well as research (digital) products. Information extraction can be broadly classified as 2 different tasks: i. text extraction from document ii. entity extraction from the free text.
For extracting insights from PDFs, Images, etc, we can use Azure AI Document Intelligence service as its powerful in extracting all the text (OCR capability) from PDFs, Images, etc. as well as extracting specific entities as per business need. However, for only extracting entities from free text, Named Entity Recognition under Azure AI > Language service will be very helpful.
Based on the business use case, we can decide which service will better fit.
- Rahul1202
  Copper Contributor
  Feb 13, 2025
  Thank you for the pointer, Jenapravat. As I write, from the experiment we did, we observed that the Azure AI document service is yielding a great result. However, our initial observation says that it requires manual mapping for each different format to extract the text with accuracy. Ideally, we are looking for a model that can recognize the similar terms from the various document types and start extracting the accurate value without manual mapping. For example, a diameter in one document could be named as the final diameter in another document. In other words, a model should be able to train itself. Is this something you have experience working with that and can guide us?
  - ml4u
    Brass Contributor
    Apr 18, 2025
    To address the challenge of extracting accurate information without manual mapping, consider using a combination of pre-trained models and custom fine-tuning. Pre-trained models can provide a good starting point, and fine-tuning them with your specific data can improve accuracy. Additionally, exploring techniques like transfer learning and embedding models can help the model learn semantic relationships between terms across different document types. This approach can reduce the need for manual mapping and improve the model's ability to generalize across various formats.
JamespaulG-0359
MCT
Feb 05, 2025
You could apply 'classical' NLP techniques like Entity recognition. In Power Automate you have AI Builder that comes with out of the box model. You also have an option to train your own custom model. If you could show some examples, I would be happy to guide you. I have experience with similar situations, in extracting structured data from free format unstructured text in invoice descriptions.
- Rahul1202
  Copper Contributor
  Feb 18, 2025
  Thank you for the pointer, James. The community does not allow to exchange email addresses. I would be happy to share some sample documents via email but just wondering how can I share it with you.
  - JamespaulG-0359
    MCT
    Feb 18, 2025
    Hi Rahul, You can send a private message by clicking on the 'message' on the top right of the profile page, with attachments as well

Forum Discussion

Using AI to convert unstructured information to structured information

18 Replies

Resources