Many big organisations are moving towards decentralisation of data teams. In a decentralised data team setup — where there will be multiple data teams, producing data sets covering different business areas, and treating them as products — a tool for data consumers to discover data sets available to them, with minimum human interaction, is needed.
About Me
Febiyan works for Pandora as Data Engineer in the Unified Data Infrastructure team. He originally started as a software engineer, but switched to data engineering career in 2013. During his free time, he goes bouldering, taking photos, and geeking out on Marvel Cinematic Universe.
Data Cataloging
Such tool also need to answer frequently-asked questions, such as but not limited to:
-
What attributes / fields does the data contain?
-
What are the explanations of such attribute?
-
Where does this data come from?
-
Whom can I contact if I have further questions?
In the prehistoric times, a central data ingestion or processing team would use something like Microsoft Excel to record what data they had. The file was copied everywhere, and some of the tribe members could have an outdated version of it at one point in their nomadic life.
It was a mess.
Things went better in the recent days, some teams used Wiki-like tools to store the information about ingested data — like Sharepoint or Confluence. It was centralised, but updates are still done manually by a central data ingestion team. The team was constantly under pressure to deliver, so sometimes they didn’t update the pages.
Cataloging data was still seen as an extra manual labour-intensive work.
Nowadays, there are specialised tools for that: Data Catalog. Such tools make data cataloging more automated to some degree. There are many products in this space: Alation, Atlan, DataHub, and many more. Microsoft also entered the space with Azure Data Catalog — which is now succeeded by Azure Purview.
This port will give you a glance of what I went through to set it up for experimental purposes, the features I used during the experiments, and what I think about it.
Setup and Configuration
Azure Purview Resource Setup
The setup steps here assume that we have a storage account with some files running. Please try to create one. Once we have the data set in the storage account, we need to register it in the Azure Purview resource.
To do that, we will need to have at least a Contributor access in a resource group scope. Ensure the Purview, and storage account that you want to connect it to, are in the same subscription.
Here’s what we need to do:
-
Open Azure Portal
-
Go to Marketplace
-
Search for “Azure Purview”
-
Click the Create button on the first result and follow the instructions
-
Wait for the deployment to complete
-
Go to Azure Purview web UI
Azure Purview Configuration
The first thing that we need to do in configuring Azure Purview is registering data sources.
The data sources we will register in Azure Purview need to be organised in collections – we can think of them as folders. By default, we have a root collection where we can register new data sources. However, we’re also able to put a collection inside another collection for granular organisation of data sources.
We can also assign privileges of the collections to a group of identities registered in Azure AD. The integration with Azure AD here is fantastic.
In this example, we will be registering an Azure Data Lake Gen-2 storage account. To do that, follow the documentation here.
Once the storage account has been registered, we can now scan the data source. To do that, in the Sources window, point to the created data source. Find and click the Radar-like button. Then follow the steps as described here. It’s super easy!
I tried scanning a storage account that was used by the Marketing and Technology team in the company I work for. The scan took 12 hours for a scan to complete – not as fast as I liked it to be. I understand that there’s a lot of data (TBs) there, but I would like to see faster scan times.
Experiencing Azure Purview
After a scan was complete, I tried searching for a data set in some samples that I uploaded. Azure Purview returns a list of data sets as it should, as shown below.
It looks okay in a glance. However, I feel like Azure Purview wastes a lot of white space in my screen: I would have loved it if the results are presented as cards that takes both horizontal and vertical spaces.
Aside from the usual data asset name, description, path – Azure Purview also provides us with data schema and data owner information. I feel these two are worth mentioning since this information is very useful!
The data schema and description will help people, who are new to the data, understand what fields are there and whether it will satisfy their needs. The data owner information will be useful if there are further questions that need to be clarified.
I would have loved to have a way to see frequently asked questions / discussions surrounding the data. I may have questions that have been answered by the data owner, but with the current setup I need to send an email instead of looking at it on the page.
I also saw that a form to look at the data summary is missing – I would love to know how a particular column’s values are like, without executing queries to quickly check if I can use it. I also think that a way to see data freshness is missing. I probably wouldn’t like to use a sales table that is not updated for 3 months.
You may see “Updated on January 17” below, but it’s not exactly the date when the data gets updated – it’s when the metadata was last updated.
Other features that I tried and liked were: data classification, and business glossary.
With the data classification feature, Azure Purview will try to classify fields it scanned to put it into categories, like “Danish National IDs” or “Phone Numbers”. This will make it way easier for data engineers and data analysts to understand what the data set contains.
The business glossary feature is useful to ensure the same business terminology are present in the fields that represent it. For example the business terminology “Consumer Id” may be represented with fields with different names, like:
-
mast_cons_id
-
mcid
-
mc_id
-
consumer_id
With the business glossary feature, an SME / business department can define a company-wide glossary and determine what it is (and what it isn’t). They can then work with the data engineering teams / data owners, to tag the fields with the business glossary.
After they do, the data analysts / data scientists will be able to understand how the business activities are represented in the data better.
Conclusion
Azure Purview was the first Data Catalogue I tried setting up on the cloud. It was surprisingly very easy to do manually. Additionally, there is a way to set it up automatically using 3rd-party Infrastructure-as-Code scripts.
Here are the aspects what I liked, in my short exposure to Azure Purview:
-
Very easy to manually deploy
-
Data source scan was easy to manually setup
-
Collections and sub-collections for organising data sources
-
Automated data classification
-
Great integration with Azure Active Directory
-
Great integration with Azure data storage techs. One-click and whoosh!
-
Business glossary feature
What I think could be improved from Azure Purview and what I would like to see:
-
Native support Major 3rd party tools, such as Hive Metastore on Databricks, need an extra Self-Hosted Integration Runtime setup
-
Push-based mechanism
-
A way to find data freshness / last updated time of data. It’s important for data analysts.
-
A way to find data distribution for numerical column values. It’s valuable for data analysts.
-
UI improvements.
The features not covered in my exploration:
-
Lineage. Currently limited to only supporting Azure Data Factory.
-
Updating metadata through Apache ATLAS API.
-
3rd party integration.
-
Controlling data access through Azure Purview.
If you want to learn more about this topic feel free to check out following Microsoft Learn modules: