This post describes how to flatten MGDC JSON objects so you can load them into SQL Server or Microsoft Fabric (OneLake).
When you get your data from Microsoft Graph Data Connect (MGDC), you will typically get that data as a collection of JSON objects in an Azure Data Lake Storage (ADLS) Gen2 storage account. For those handling large datasets, it might be useful to move the data to a SQL Server or to OneLake (lakehouse). In those cases, you might need to flatten the datasets. This post describes how to do that.
If you’re not familiar with MGDC for SharePoint, start with https://aka.ms/SharePointData.
1. Flattening
Most of the MGDC for SharePoint datasets come with nested objects. That means that a certain object has other objects inside it. For instance, if you have a SharePoint Groups object, it might have multiple Group Members inside. If you have a SharePoint Permissions object, you could have many Permissions Recipients (also known as Sharees). For each SharePoint File object, you will have a single Author object inside.
When you convert the datasets from JSON to other formats, it is possible that these other formats require (or perform better) if you don’t have any objects inside objects. To overcome that, you can turn those child objects into properties of the parent object. For instance, instead of having the File object with an Author object inside, you can have multiple author-related columns. For instance, you could have Author.Name and Author.Email as properties of the flattened File object.
2. Nested Objects
You can get the full list of SharePoint datasets in MGDC at https://aka.ms/SharePointDatasets.
Here is a table with a list of objects and their nested objects:
Object |
How many? |
Primary Key |
Nested Object |
How many? |
Add to Primary Key |
Sites |
1 per Site |
Id |
RootWeb |
1 per Site |
|
Sites |
1 per Site |
Id |
StorageMetrics |
1 per Site |
|
Sites |
1 per Site |
Id |
SensitivityLabelInfo |
1 per Site |
|
Sites |
1 per Site |
Id |
Owner |
1 per Site |
|
Sites |
1 per Site |
Id |
SecondaryContact |
1 per Site |
|
Groups |
1 per Group |
SiteId + GroupId |
Owner |
1 per Group |
|
Groups |
1 per Group |
SiteId + GroupId |
Members |
1 per Member |
COALESCE(AADObjectId, Email, Name) |
Permissions |
1 per Permission |
SiteId + ScopeId + RoleDefintion + LinkId |
SharedWithCount |
1 per Recipient Type |
Type |
Permissions |
1 per Permission |
SiteId + ScopeId + RoleDefintion + LinkId |
SharedWith |
1 per Recipient or Sharee |
COALESCE(AADObjectId, Email, Name) |
Files |
1 per File |
SiteId + WebId + ListId + ItemId |
Author |
1 per File |
|
Files |
1 per File |
SiteId + WebId + ListId + ItemId |
ModifiedBy |
1 per File |
|
When you flatten a dataset and there is an object with multiple objects inside (like Group Members or Permission Recipients), the number of rows will increase. You also need to add to primary key to keep it unique.
Also note that the File Actions, Sync Health and Sync Errors datasets do not have any nested objects.
3. One Object per Parent
When the nested object has only one instance, things are simple. As we described for the Author nested object inside the File object, you promote the properties of the nested object to be properties of the parent object. This is because the Author is defined as the user that initially created the file. There is always one and only one Author.
This can happen even happen multiple times for the same object. The File also has a ModifiedBy property. That is the single user that last changed the file. In that case, there is also only one ModifiedBy per File.
The Site object also includes several properties in this style, like RootWeb, StorageMetrics, SensitivityLabelInfo, Owner and SecondaryContact. Note that, in the context of the Site object, there is only one owner. Actually two, but that second one is tracked in a separate object called SecondaryContact which is effectively the secondary owner.
4. Multiple Objects per Parent
The SharePoint Permissions dataset has a special condition that might create trouble for flattening. There are two sets of nested objects with multiple objects each: SharedWith and SharedWithCount. SharedWith has the list of Recipients and SharedWithCount has a list of Recipient Types. If you just let the tools flatten it, you will end up a cross join of the two. As an example, if you have 4 recipients in an object and 2 types of recipients (internal users and external users, for instance) you will end up with 20 objects in the flattened dataset instead of the expected 10 objects (one per recipient).
To avoid this, in this specific condition, I would recommend just excluding the SharedWithCount column from the object before flattening.
5. Conclusion
I hope this clarifies how you can flatten the MGDC for SharePoint datasets, particularly SharePoint Permissions dataset.
For further details about the MGDC for SharePoint, https://aka.ms/SharePointData.
Updated Nov 27, 2024
Version 4.0Jose_Barreto
Microsoft
Joined April 02, 2018
Microsoft Graph Data Connect for SharePoint Blog
Follow this blog board to get notified when there's new activity