Azure Cognitive Search is the Microsoft product for Knowledge Mining, a process to extract information from unstructured or semi-structured data. However, extracting knowledge of file names or paths is not trivial. In this article you will learn how to do it using Azure Functions for Python, that went to GA on August 19th, 2019.
The lack of metadata is a common scenario in all companies in the world, usually people don’t have time or discipline to add tags or comments to their documents, leaving the file name and its path the only information available for searches. This problem is addressed with Knowledge Mining, metadata will be created for you. Azure Cognitive Search uses AI to extract insights from your files content.
But look at the image below. With the file name and its path have we already discovered the name of the project, the year, the type of the file, its title, among other things. It is very common to see dates, hours, names, and locations in the names and paths of the files.
Figure 1: Lack and valuable metadata
The Text Merge skill consolidates text from documents content with images OCR Text into a single annotation within your Azure Cognitive Search Enrichment Pipeline . A common scenario for using Text Merge is to merge the textual representation of images (text from an OCR skill, or the caption of an image) into the content field of a document. By doing this, you will optimize insights extraction, index storage, and user interface development.
But you can't use the Merge Skill for file names or paths, because on the required offsets property. It is one of the OCR Skill outputs and you don't have this information for physical properties of the documents.
That's why you need to leverage Custom Skills flexibility to merge properties like metadata_storage_name and metadata_storage_path with the document content. This article code uses the file name, but it would work fine with file’s dates, path, author, etc. It will also work with other skills outputs, like image analysis.
The code of the solution suggested is available is in this GitHub repo and here are some important guidelines:
Figure 2: How to connect the Custom Skill to your Skillset
Here is a list of good practices from our experience when creating this solution for a client:
Yes, there is. Also not supported or documented, and only using built in skills. I don’t recommend this alternative since I have never tested it and much more steps are required to achieve a similar objective.
Offsets and itemsToInsert are Merge Skill properties that expect arrays. You can submit the file name, or any other string, to the Split Skil. It will split the content into an array of 2 positions, and then need to use the Conditional Skill to create an array with the hard coded values. The last step is to run the Merge Skill to merge the contents in the array to the rest of the content.
For Azure Search click-to-deploy C# Custom skills, created by the Azure Search Team, use the Azure Search Power Skills. Previous C# knowledge or software installations aren't required, they have the "one click deployment" concept. There are Custom Skills for Lookup, distinct (duplicates removal), and more.
Useful links for those who want to know more about Knowledge Mining:
This post helps you to create Python Custom Skill, for Azure Cognitive Search, based on Azure Functions for Python. It merges 2 strings in a third one. Typical usage is when you want to concatenate, within an Enrichment Pipeline, the file name or path with the content. This skill is indicated for scenarios when the file name or path have dates, organizations, names, or key phrases.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.