Mine Knowledge on File Names and Paths with a Python Custom Skill
Azure Cognitive Search is the Microsoft product for Knowledge Mining, a process to extract information from unstructured or semi-structured data. However, extracting knowledge of file names or paths is not trivial. In this article you will learn how to do it using Azure Functions for Python, that went to GA on August 19th, 2019.
Why mine knowledge of file names and paths?
The lack of metadata is a common scenario in all companies in the world, usually people don’t have time or discipline to add tags or comments to their documents, leaving the file name and its path the only information available for searches. This problem is addressed with Knowledge Mining, metadata will be created for you. Azure Cognitive Search uses AI to extract insights from your files content.
But look at the image below. With the file name and its path have we already discovered the name of the project, the year, the type of the file, its title, among other things. It is very common to see dates, hours, names, and locations in the names and paths of the files.
Figure 1: Lack and valuable metadata
Why do we need a Custom Skill?
TheText Mergeskill consolidates text from documents content with images OCR Text into a single annotation within your Azure Cognitive Search Enrichment Pipeline . A common scenario for using Text Merge is to merge the textual representation of images (text from an OCR skill, or the caption of an image) into the content field of a document. By doing this, you will optimize insights extraction, index storage, and user interface development.
But you can't use the Merge Skill for file names or paths, because on the required offsets property. It is one of the OCR Skill outputs and you don't have this information for physical properties of the documents.
That's why you need to leverage Custom Skills flexibility to merge properties like metadata_storage_name and metadata_storage_path with the document content. This article code uses the file name, but it would work fine with file’s dates, path, author, etc. It will also work with other skills outputs, like image analysis.
Code and Deployment
The code of the solution suggested is available is in this GitHub repo and here are some important guidelines:
If you are curious to know why I am using Python and Azure Functions, check this previous blog post.
Start with this tutorial, to create and deploy your environment
The recommended Python version is 3.7.4, meaning that if you use Anaconda actual version, November 2019, you are free of any version preoccupation. If you are in the future and have a newer version, you can use conda to create the requested environment: conda create -n your-env-name python=3.6
When you create a local project, with the command func init your-project-name, all necessary files are created within your project folder. Including one file for requirements (like an yml file) and py, that is a template for your code. At the end of the day, Azure Functions will simulate conda with the requirements you specify into the requirements.txt file.
Please note that you need to use mimetype=”application/json” for your http-response, since the Cognitive Search interface expects a json file as a return.
You will need to pip install functions from your command line interface.
The code removes special characters. Please check your business requirements and the lessons learned below to define what transformations you need.
As you can see in the image below, the metadata_storage_name and the content are the input strings for the skill.
Figure 2: How to connect the Custom Skill to your Skillset
Key Lessons Learned
Here is a list of good practices from our experience when creating this solution for a client:
When possible, leverage global cached data for the reference data. It is not guaranteed that the state of your app will be preserved for future executions. However, the Azure Functions runtime often reuses the same process for multiple executions of the same app. In order to cache the results of an expensive computation, declare it as a global variable.
Always prepare your code to deal with empty result sets, if a term is filtered, the result is empty string to be added to the result set.
VS Code and Postman will work great for local debugging, you just need to save the new version of your python code and the changes are effective immediately, not requiring you to restart the service. This dynamic process allows you to quickly change your code and see the results.
In your code, use dumps on your output variable to validate what your skill returns to Cognitive Search. This will give you the opportunity to fix the layout in case of error.
The Text Analytics API, that is used under the hood, will remove characters like "_" or "-". But if you submit "vacation_summer_in_Brazil_01.jpg", you will get "Brazil" as an entity of the location type, and nothing else. However, if you submit "vacation summer in Brazil 01 jpg", you will get:
Key Phrases: vacation summer, Brazil, jpg
Entities: Brazil (location), summer (datetime-dateRange), 01 (quantity-number).
Is there an alternative for a Custom Skill?
Yes, there is. Also not supported or documented, and only using built in skills. I don’t recommend this alternative since I have never tested it and much more steps are required to achieve a similar objective.
Offsets and itemsToInsert are Merge Skill properties that expect arrays. You can submit the file name, or any other string, to the Split Skil. It will split the content into an array of 2 positions, and then need to use the Conditional Skill to create an array with the hard coded values. The last step is to run the Merge Skill to merge the contents in the array to the rest of the content.
PowerSkills – Azure Search Team Official Custom Skills
For Azure Search click-to-deploy C# Custom skills, created by the Azure Search Team, use the Azure Search Power Skills. Previous C# knowledge or software installations aren't required, they have the "one click deployment" concept. There are Custom Skills for Lookup, distinct (duplicates removal), and more.
Useful links for those who want to know more about Knowledge Mining:
This post helps you to create Python Custom Skill, for Azure Cognitive Search, based on Azure Functions for Python. It merges 2 strings in a third one. Typical usage is when you want to concatenate, within an Enrichment Pipeline, the file name or path with the content. This skill is indicated for scenarios when the file name or path have dates, organizations, names, or key phrases.