Cognitive Search is an AI-first approach to content understanding, powered by Azure Search and Cognitive Services. A custom skill is a Cognitive Search capability that extends your knowledge mining solution through REST API calls. Azure Functions is a serverless hosting platform in which Python support went to GA on August 19th, 2019.
In this article you will learn how to combine these services to expand your Knowledge Mining solution by adding an Azure-Functions-for-Python Custom skill, to filter unwanted terms created by Predefined skills like Key-Phrases or Entities extractions. This custom skill will prevent those terms to be loaded into your Azure Search index.
Predefined and Custom skills are used within Azure Cognitive Search Enrichment Pipeline, when Microsoft AI is used to created metadata about your data. To learn more about this process, click here to see the product documentation or here to read our blog posts about it.
Figure 1- The simplified view of the solution architecture diagram
Why a Python Custom Skill
Python became the data science lingua franca, a common tongue that bridges language gaps within and between organizations. Because of containerization, community contributions and knowledge, open source libraries, and frameworks, a python code for custom skill allows you to easily deploy, migrate, or maintain the code.
Another key factor for the decision to use Python as a custom skill is that Azure Functions support for it is Generally Available. This allows you to create serverless computation with minimal effort, on demand performance, and predictable costs on demand. Azure Functions is the default Azure compute option for Cognitive Search custom skills and now you don’t need to only use C# to create one.
Why Azure Functions
Python code can run within containers and has many deployment options like Azure Kubernetes Service (AKS) and Azure Container Instances (ACI). Both are flexible and abstract the required infrastructure required to run your containers. Both also can be provisioned and deployed with AML SDK for python, allowing you to prepare and deploy your code as a web service (REST API) right from Jupyter Notebook, Visual Studio Code, or any other IDE you prefer.
But Azure Functions has key advantages, starting with the pricing. It will charge you on per-second resource consumption and executions. Consumption plan pricing includes a monthly free grant of 1 million requests and 400,000 GB-s of resource consumption per month. For most of the cases, this free tier will be enough to run the all workload. At the same time, AKS and ACI deployments will charge you around the clock, being used or not, creating unnecessary costs for this post specific scenario, with a big initial load and a small daily usage.
The last Azure Function decisive feature is the automatic protocol management. Azure Search requires https for custom skills, adding extra work for AKS or ACI deployments: certificate acquisition and management. Azure Search will test if the certificate is valid, so there is no workaround to a valid certificate usage. In the other hand, Azure Functions handles that for you automatically, using by default the configuration of the image below, that is compliant with Azure Search security requirements. It will also scale out automatically, important feature for unpredictable workloads.
Figure 2: https settings
Code and Deployment
The code of this solution is available in this GitHub repo and here are some important guidelines:
- Start with this tutorial, to create and deploy your environment
- The recommended Python version is 3.6. If you have a newer version, you can use conda to create the requested environment: conda create -n your-env-name python=3.6
- When you create a local project, with the command func init your-project-name, all necessary files are created within your project folder. Including one file for requirements (like a yml file) and __init__.py, that is a template for your code. At the end of the day, Azure Functions will simulate conda with the requirements you specify into the requirements.txt file.
- Please note that you need to use mimetype=”application/json” for your http-response, since the Cognitive Search interface expects a json file as a return.
- Use encoding ='latin1' to avoid errors with special characters.
- The unwanted terms are in a CSV file in an Azure Storage Account. To stream this file from Azure Blob Storage you need to use the get_blob_to_stream method, but it doesn’t have much documentation. To help you, here is what you need to use it:
Figure 3 - Details of the Python code
Key Lessons Learned
Here is a list of good practices from our experience when creating this solution for a client:
- When possible, leverage global cached data for the reference data. It is not guaranteed that the state of your app will be preserved for future executions. However, the Azure Functions runtime often reuses the same process for multiple executions of the same app. In order to cache the results of an expensive computation, declare it as a global variable.
- Instead of downloading, stream the reference data file, to increase performance and maximize decoupling.
- Always prepare your code to deal with empty result sets, if a term is filtered, the result is empty string to be added to the result set.
- VS Code and Postman will work great for local debugging, you just need to save the new version of your python code and the changes are effective immediately, not requiring you to restart the service. This dynamic process allows you to quickly change your code and see the results.
- The documentation (Azure Functions for Python) doesn’t mention that you will need docker to deploy a code that uses binaries like azure-storage methods. The step by step process to work around this limitation is described here.
- In your code, use json.dumps on your output variable to validate what your skill returns to Cognitive Search. This will give you the opportunity to fix the layout in case of error.
- For performance, prepare your code to process multiple documents in each execution, allowing you to use a batch size larger than 1. Please check the loops within the sample code.
- For production environments, please follow Azure Functions and Azure Search best practices.
PowerSkills – Official Custom Skills
Breaking news! For 1-click-to-deploy C# Custom skills, created by the Azure Search Team, click here. This initiative is called PowerSkills and doesn't have require previous C# knowledge or software installations. There are Custom Skills for Lookup, Distinct (duplicates removal), and more.
Related Links
Useful links for those who want to know more about Knowledge Mining:
- Python Custom Skills Toolkit - GitHub
- Knowledge Mining Accelerator - aka.ms/kma
- Knowledge Mining Bootcamp – aka.ms/kmb
- Knowledge Mining posts – aka.ms/ACE-Blog
Conclusion
This solution addresses the requirement to remove unwanted terms, cleaning the general entities or key phrases created by Cognitive Search. But it can be easily adapted to wanted terms (lookup) or duplicates removal (deduplication). Our Knowledge Mining Accelerator (KMA) implements a C# version of the unwanted terms custom skill; you can clone the repo and use that open source code. Stay tuned, we will also publish here posts on how to do filtering and deduplication using python custom skills!!