Forum Discussion

Rahul1202's avatar
Rahul1202
Copper Contributor
Feb 04, 2025

Using AI to convert unstructured information to structured information

We have a use case to extract the information from various types of documents like Excel, PDF, and Word and convert it into structured information. The data exists in different formats. 

We started building this use case with AI Builder, and we hit the roadblock and are now exploring ways using the Co-pilot studio. 
It would be great if someone could point us in the right direction.
What should be the right technology stack that we should consider for this use case? 
Thank you for the pointer.

18 Replies

  • jenapravat's avatar
    jenapravat
    Copper Contributor

    Extracting insights from documents is key to analytical as well as research (digital) products. Information extraction can be broadly classified as 2 different tasks: i. text extraction from document ii. entity extraction from the free text.

    For extracting insights from PDFs, Images, etc, we can use Azure AI Document Intelligence service as its powerful in extracting all the text (OCR capability) from PDFs, Images, etc. as well as extracting specific entities as per business need. However, for only extracting entities from free text, Named Entity Recognition under Azure AI > Language service will be very helpful. 

    Based on the business use case, we can decide which service will better fit.

    • Rahul1202's avatar
      Rahul1202
      Copper Contributor

      Thank you for the pointer, Jenapravat. As I write, from the experiment we did, we observed that the Azure AI document service is yielding a great result. However, our initial observation says that it requires manual mapping for each different format to extract the text with accuracy. Ideally, we are looking for a model that can recognize the similar terms from the various document types and start extracting the accurate value without manual mapping. For example, a diameter in one document could be named as the final diameter in another document. In other words, a model should be able to train itself. Is this something you have experience working with that and can guide us?

      • ml4u's avatar
        ml4u
        Brass Contributor

        To address the challenge of extracting accurate information without manual mapping, consider using a combination of pre-trained models and custom fine-tuning. Pre-trained models can provide a good starting point, and fine-tuning them with your specific data can improve accuracy. Additionally, exploring techniques like transfer learning and embedding models can help the model learn semantic relationships between terms across different document types. This approach can reduce the need for manual mapping and improve the model's ability to generalize across various formats.

  • You could apply 'classical' NLP techniques like Entity recognition. In Power Automate you have AI Builder that comes with out of the box model. You also have an option to train your own custom model.  If you could show some examples, I would be happy to guide you. I have experience with similar situations, in extracting structured data from free format unstructured text in invoice descriptions.

    • Rahul1202's avatar
      Rahul1202
      Copper Contributor

      Thank you for the pointer, James. The community does not allow to exchange email addresses. I would be happy to share some sample documents via email but just wondering how can I share it with you.

      • Hi Rahul, You can send a private message by clicking on the 'message' on the top right of the profile page, with attachments as well

         

Resources