If you have a large SharePoint tenant (with hundreds of thousands of sites), you might run into some issues while trying to use Microsoft Graph Data Connect for SharePoint to run the Oversharing template provided for this scenario.
To help you on this project, I collected here a few tips to smooth the process and avoid a few of the common issues large enterprises might hit when running this for the first time.
To pull the data from your tenant, you will need to enable the Microsoft Graph Data Connect in the Office Admin Center, create several resources in the Azure Admin Portal and then authorize the Application to pull data back in the Office Admin Center.
In large organizations, it is likely that the Office Global Administrator role is enabled only for a few people in the organization. Also, making Office 365 configuration changes in production will require an internal approval process that could take days or even weeks.
Also, there are various Azure Administrator roles required to create an Application Registration, create a Resource Group, create a Storage Account, grant permissions to Storage Accounts, create a Microsoft Graph Data Connect Application and create a Synapse workplace. In a large enterprise, it is likely that these will be delegated to different people and internal approval processes might be required.
In general, it helps if you know exactly what configuration change you need. There is a step-by-step guide at Step-by-step: Gather a detailed dataset on SharePoint Sites using the Microsoft Graph Data Connect
This step-by-step covers the scenario for the SharePoint Sites dataset, but the pre-requisites are the same for all the Microsoft Graph Data Connect datasets.
Large corporations are more likely to use multiple Azure and Microsoft 365 regions. Microsoft Graph Data Connect requires you to setup your Azure resources in a region that matches the Microsoft 365 region for your tenant. That includes both the Azure Storage Account and the Azure Synapse Workspace. You can see details at MGDC for SharePoint FAQ: Which regions are supported?
If you try to pull from an Azure region that does not match the Microsoft 365 region, you might get an empty dataset. The SharePoint team is adding improved error messages to clearly state the situation and suggest using another region.
Large corporations are also more likely to have a Microsoft 365 tenant that have SharePoint sites in multiple regions. For those, the tenant must configure Microsoft Graph Data Connect to pull each region separately, pulling the data into separate Azure Storage Accounts.
After the data is transferred to the tenant, they will have a set of files across multiple storage accounts, entirely under their control. They could potentially consolidate the data into a single storage account, a single SQL Server or other storage locations, in accordance with their compliance requirements.
The Microsoft Graph Data Connect for SharePoint team provides a convenient Oversharing template that you can pull from the Azure Synapse Gallery. There are detailed instructions on how to get started with this template at https://go.microsoft.com/fwlink/?linkid=2207816
That includes how to import the template into a new pipeline in Azure Synapse from the Gallery.
The initial configuration for this might seem a little complex, so I would strongly recommend that you use the previously mentioned step-by-step for Sites before venturing into this one. This will also give you the two linked services needed for source (Office 365) and sink (Azure Storage), configured with the correct application credentials. You can re-use them when pulling the template from the Synapse gallery.
The Oversharing Template is a single pipeline that brings in 6 distinct datasets and combines them into an Oversharing table that includes all the permissions and group details. This is very convenient, particularly if you want to schedule it to run regularly. The pipeline includes six “Copy data” tasks to bring down the datasets required, plus two “Notebook” tasks to process the group expansion.
When you run that pipeline for a large enterprise, it might take a few hours to complete. Also, if you’re running it for the first time, you could hit some issue (a missing permission, for instance) and that could require you to troubleshoot and run the pipeline again. This might lead you to pull down a large dataset more than once.
To avoid that, you could break your pipeline into separate steps, instead of having all of them connected. This will allow you to test each individual task before running the entire process as a single pipeline. Once you have every “Copy data” task running successfully, you can run the two “Notebook” tasks to process the AAD and SPO group expansions.
One of the interesting aspects of the Oversharing template is that it uses parameters to indicate the date that you are using for dataset extraction and the location in Azure Storage where you are landing the data.
If you are bringing each dataset down one at a time, you can use these paths to verify the resulting dataset before running the next task. Note that there is also a “latest” folder where the combined data will land when you run the aggregation notebooks.
The final step for Oversharing reporting is to pull the resulting datasets (which is basically a folder with a few JSON files in it) from Power BI so you can create your own analytics. Power BI is an amazing tool that will make it easy to pull lots of data and do all sorts of aggregation and summaries, both as reports and as charts.
However, if your tenant ends up outputting tens of millions of rows or more, it is possible that Power BI won’t be able to handle that large dataset, particularly if your PC does not have a lot of memory. From my experience, Power BI Desktop running on a fast PC with 32GB of RAM can typically handle a few million rows of data. If you have more than that, you will need to do some preparation work outside Power BI.
One option is to run a Synapse Notebook to do some data aggregation before you pull the data into Power BI. For instance, you could summarize the permission information per site and pull that summary into Power BI. This will require some additional work in Synapse, and you will have less flexibility on how to pivot the data in Power BI.
Another option is to load the data from Azure Storage into an SQL Server database. Then you can point Power BI to that database, which will scale much better than using JSON files. You will also be able to write SQL queries in Power BI as you pull the data, keeping this flexible.
In Azure Synapse, you can use a “Copy Data” task to move the data from JSON to SQL. You can also use a “Data Flow” task to transform the data. Moving to SQL will also require you to flatten (normalize) the Permissions dataset, which has multiple users/groups in a single permission object.
Note that you must first land the Microsoft Graph Data Connect dataset in an Azure Storage account before you can transform it and/or move it to SQL Server.
After you download all the datasets, expand group membership information for both Active Directory and SharePoint, flatten out the permissions dataset and upload everything to a SQL Server database, you will be able to report from Power BI connecting to that database.
Once you get your first full pull from Microsoft Graph Data Connect for SharePoint, you will want to update it regularly. You can run the pipeline again, but this would cause you to redownload all the datasets in their entirety.
To reduce costs, you should investigate the use of “Delta” dataset pulls, which will reduce the number of objects downloaded, pulling just what change since your last pull. Some restrictions do apply. See details at MGDC for SharePoint FAQ: How can I use Delta State Datasets?
The Oversharing template includes three Azure Active Directory (Microsoft Entra ID) datasets to get information about AAD Groups (one object per group), AAD Group Owners (one object per group owner) and AAD Group Members (one object per group member). It is important to note that these are not SharePoint datasets.
The main difference here is that these datasets are collected at the time of the request, unlike the SharePoint datasets, which are collected and pre-processed daily, during the tenant’s off-peak hours.
There are two main consequences of that:
I hope this post helps with your deployment of the Oversharing template at your large enterprise. For more information, check the overview post at https://aka.ms/SharePointData.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.