Step-by-step: Gather a detailed dataset on SharePoint Sites using the Microsoft Graph Data Connect

Jose Barreto · ‎Feb 29 2024

0. Overview

This blog shows a step-by-step guide to getting SharePoint Sites information using the Microsoft Graph Data Connect for SharePoint. This includes detailed instructions on how to extract SharePoint and OneDrive site information and use that to run analytics for your tenant. If you follow these steps, you will have a Power BI dashboard like the one shown below, which includes total bytes per site type, number of sites by owner, and total file count by month created. You can also use the many other properties available in the SharePoint Site dataset.

To get there, you can split the process into 3 distinct parts:

Set up your tenant for the Microsoft Graph Data Connect, configuring its prerequisites.
Configure and run a pipeline to get SharePoint Sites using Azure Synapse.
Use Power BI to read the data about SharePoint Sites and show it in a dashboard.

1. Setting up the Microsoft Graph Data Connect

The first step in the process is to enable the Microsoft Graph Data Connect and its prerequisites. You will need to do a few things to make sure everything is ready to run the pipeline:

Enable Data Connect in your Microsoft 365 Admin Center. This is where your Tenant Admin will check the boxes to enable the Data Connect and enable the use of SharePoint datasets.
Create an application identity to run your pipelines. This is an application created in Azure Active Directory which will be granted the right permissions to run your pipelines and access your Azure Storage account.
Create an Azure Resource Group for all the resources we will use for Data Connect, like the Azure Storage account and the Azure Synapse workspace.
Create an Azure Storage account. This is the place in your Azure account where you will store the data coming from your pipeline. This is also the place where Power BI will read the data for creating the dashboards.
Create a container and folder in your Storage Account. This is the location where the data will go.
Grant the application identity the required access to the Storage account. This makes sure that the application identity has permission to write to the storage.
Add your Microsoft Graph Data Connect application in the Azure Portal. Your Microsoft Graph Data Connect application needs to be associated with a subscription, resource group, storage account, application identity and datasets.
Finally, your Global Administrator needs to use the Microsoft Admin Center to approve the Microsoft Graph Data Connect application access.

Let us look at each one of these.

1a. Enable the Microsoft Graph Data Connect

The final preparation step is to go into the Microsoft 365 Admin Center and enable the Microsoft Graph Data Connect.

Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/ and make sure you are signed in as a Global Administrator.
Select the option to Show all options on the left.
Click on Settings, then on Org settings.

Select the settings for Microsoft Graph Data Connect.

Check the box to turn Data Connect on.
Make sure to also check the box to enable access to the SharePoint and OneDrive datasets.
IMPORTANT: You must wait 48 hours for onboarding your tenant and another 48 hours for the initial data collection and curation. For example, if you check the boxes on August 1st, you will be able to run your first data pull on August 5th, targeting the data for August 3rd. You can continue with the configuration, but do not trigger your pipeline before that.

1b. Create the Application Identity

You will need to create an Application in Microsoft Entra ID (formerly Azure Active Directory) and setup an authentication mechanism, like a certificate or a secret. You will use this Application later when you configure the pipeline. Here are the steps:

Navigate to the Azure Portal at https://portal.azure.com
Find the Microsoft Entra ID service in the list of Azure services.
Select the option for App Registration on the list on the left.
Click the link to New Registration to create a new one.

Enter an app name, select “this organizational directory only” and click on the Register button.
On the resulting screen, select the link to Add a certificate or secret.
Select the “Client secrets” tab and click on the option for New client secret.
Enter a description, select an expiration period, and click the Add button.
Copy the secret value (there is a copy button next to it). We will need that secret value later.
Secret values can only be viewed immediately after creation. Save the secret before leaving the page.

Click on the Overview link on the left to view the details about your app registration.
Make sure to copy the application (client) ID. We will need that value later as well.

1c. Create the Azure Resource Group

You will need to create an Azure Resource Group for all the resources we will use for Data Connect, including the Storage Account and Synapse Workspace. Here are the steps.

Navigate to the Azure Portal at https://portal.azure.com
Find the Resource Groups in the list of Azure services.

Click on the Create link to create a new resource group.
Select a name and a region.
IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant.

Click on Review + Create, make sure you have everything correctly entered and click Create.

1d. Create the Azure Storage Account

You will need to create an Azure Storage Account to store the data coming from SharePoint. This should be an Azure Data Lake Gen2 storage account. You should also authorize the Application you created to write to this storage account. Here are the steps.

Navigate to the Azure Portal at https://portal.azure.com
Find the Storage accounts service in the list of Azure services.

Click on the Create link to create a new storage account.
Select a subscription, resource group (created in step 1d), account name, region, and type (standard is fine).
Make sure your new account name contains only lowercase letters and numbers.
IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant.

Click on the Advanced tab. Under Data Lake Storage Gen2 check the box to Enable hierarchical namespace.
Click on Review, make sure you have everything correctly entered and click Create.
Wait until the deployment is completed and click on Go to resource.
Click on the Access keys option on the left to see the keys to access the storage account.
Click on Show for one of the two keys and use the copy icon whenever you need the key.

1e. Grant access to the Storage Account

You will need to grant the Application Id the required access to the Storage Account. Here are those steps:

In the Storage account you just created, click the Access Control (IAM) option on the left.
Click on the link to Add on the horizontal bar.

Click on the link to Add on the horizontal bar and click on the option to Add role assignment.
In the Role tab, select the built-in Storage Blob Data Contributor role and click on the Next button.
In the Members tab, select user, group or service principal and click on the Select members link.
In the Select members window, click on the application id you created in item 1b and click the Select button.
Then click on the Review + Assign button.

Review the role assignment and click on the Review + assign button.
You’ve now completed the role assignment.

1f. Create a container and folder in your Storage Account

The next step is to create a container and folder for the data you will bring from Data Connect. Follow these steps:

In the Storage account you just created, click the Containers option on the left.
You will see only the default $logs container in the list. Click on the Container link on the horizontal bar.

Click on the newly created container and in that container, click on + Add Directory.
With that, you have a location to later store your data with the path as container/folder.

1g. Add your Microsoft Graph Data Connect application

Your Microsoft Graph Data Connect application needs to be associated to a subscription, resource group, storage account, application identity and datasets. This will define everything that the app will need to run your pipelines.

Search for the “Microsoft Graph Data Connect” service in the Azure Portal at https://portal.azure.com or navigate directly to https://aka.ms/MGDCinAzure to get started.

Select the option to Add a new application.

Under Application ID, select the one from step 1b and give it a description.
Select Single-Tenant for Publish Type.
Select Azure Synapse for Compute Type.
Select Copy Activity for Activity Type.
Fill the form with the correct Subscription and Resource Group (from step 1c).
Under Destination Type, select Azure Storage Account.
Under Storage Account, select the Storage Account we created in step 1d.
Under Storage Account Uri, select the option with “dfs” in the name.
Click on “Next: Datasets”.

In the dataset page, under Dataset, select BasicDataSet_v0.SharePointSites_v1.
Under Columns, select all.
Click on “Review + Create” and click “Create” to finish.
You will now see the app in the list for Graph Data Connect.

1h. Approve the Microsoft Graph Data Connect Application

Your last step in this section is to have a Global Administrator approve the Microsoft Graph Data Connect application.

Make sure this step is performed by a Global administrator who is not the same user that created the application.
Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/
Select the option to Show all options on the left.
Click on Settings, then on Org settings.
Click on the tab for Security & privacy.

Select the option for settings for ‎Microsoft Graph Data Connect‎ applications.

You will see the app you defined with the status Pending Authorization.
Double-click the app name to start the authorization.
Follow the wizard to review the app data, the datasets, the columns and the destination, clicking Next after each screen.

In the last screen, click on Approve to approve the app.

Note: The Global administrator that approves the application cannot be the same user that created the application. If it is, the tool will say “app approver and developer cannot be the same user.”

2. Run a Pipeline

Next, you will configure a pipeline in either Azure Data Factory or Azure Synapse. We will use Synapse here. You will trigger this pipeline to pull SharePoint data from Microsoft 365 and drop it on the Azure Storage account. Here is what you will need to do:

Create a new Azure Synapse workspace. This is the place where you create and run your pipelines.
Use the Copy Data tool in Azure Synapse. This tool will help you with the task.
Create a new source to get the SharePoint sites dataset from Microsoft 365.
Create a new destination with a storage folder in Azure Storage to receive the data.
Deploy and trigger the pipeline.
Monitor the pipeline to make sure it has finished running and that the data is available.

Let us look at each one of these.

2a. Create the Azure Synapse workspace

To get started, you need to create an Azure Synapse workspace, if you do not already have one. Here are the steps:

Navigate to the Azure Portal at https://portal.azure.com
Find the Azure Synapse Analytics service in the list of Azure services.
Click on the Create link to create the new Azure Synapse workspace.
Enter the subscription, resource group (created in step 1d), the new workspace name, region, storage account name (created in step 1e) and new file system name.
IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant

Click on the Security tab. Select the option to Use only AAD authentication. Click on the Review+create button.
Click Create. Wait until the deployment is completed and click on Go to resource.

Note: After you create an Azure Synapse workspace, you might run into an error that says, “The Azure Synapse resource provider (Microsoft Synapse) needs to be registered with the selected subscription”. You might also run into a validation error later with a message like “Customer subscription GUID needs to be registered with Microsoft.Sql resource provider”. These providers might not be registered with your subscription by default. If you run into these issues, see this doc on how to register a new resource provider and make sure your subscription is registered with both the Microsoft.Synapse and the Microsoft.Sql resource providers. Thanks to Carl Grzywacz for pointing these out.

2b. Use the Copy Data tool in Azure Synapse

Our Azure Data Factory pipeline will use a data source (Microsoft 365) and a data sink (Azure Storage). Let us start by configuring the data source in our Data Factory. Follow the steps.

Navigate to the Azure Portal at https://portal.azure.com
Find the Azure Synapse Analytics service in the list of Azure services.
Click on the name of your Azure Synapse workspace (created in item 2a).
Click on the Open link inside the big box for Synapse Studio.

In the Synapse Studio, select the fourth icon on the left to go to the Integrate page.
Click on the bug + icon and select the option for the Copy Data tool to start.

Keep options for the Built-in copy task and Run once now. Then click the Next button.

You will then have to define the source and destination.

2c. Define the data source

The first step is to define your data source, which will be the Microsoft Graph Data Connect (Data Connect source). Here are the steps you should take:

On the Source data store page, click on the New connection option.

On the New connection page, enter “365” on the search box and select Microsoft 365 (Office 365).

Click the Continue button to reach the page to define the details of the new connection.
Enter the Name and Description for the new connection
Also enter the Service principal ID and the Service principal key. These are the application id and the secret that we captured in step 1c.

Click on the Test connection option on the bottom right to make sure the credentials are working.
Then click on the Create button to create the new connection and go back to the Source data store page.
This time around, the connection will be filled in and the list of datasets will be available.
Check the box next to BasicDataSet_v0.SharePointSites_v1 and click on the Next button.

In the Apply Filter page, keep the default scope.
Select SnapshotDate as a column filter and select a date. Since we are doing a full pull, you should use the same date for Start and End Time.
IMPORTANT: Valid dates go from 23 days ago to 2 days ago.
IMPORTANT: You cannot query dates before the date when you enabled SharePoint dataset collection.

Click on the Next button to finish the source section and move to the destination section.

2d. Define the data destination

Next, you need to point to the location where the data will go, which is an Azure Storage account. Here are the steps:

On the Source data store page, click on the New connection option.

Select the option for Azure Data Lake Storage Gen 2

Click the Continue button to reach the page to define the details of the new connection.
Enter the Name and Description for the new connection.
Change the Authentication type to Service Principal, add the Storage account name from the drop-down list.
Enter the Service principal ID and the Service principal key. Again, these are the application id and the secret that we captured in step 1c.

Click on the Test connection option on the bottom right to make sure the credentials are working.
Then click on the Create button to create the new connection and go back to the Destination data store page.
This time around, the connection will be filled in and a few options will be available.
Enter a Folder path. This is the container and folder you created in step 1f and you can browse to it.

Click Next to reach the Review and finish page of the Copy Data tool.

2e. Deploy and trigger the pipeline

Now we will deploy the pipeline and run it. Follow the steps:

In the Review and finish page, click the Edit link on the top right to enter a name and description for your pipeline. Then click Save.

Click on the Next button to start the deployment.

Once it is all finished, click on the Monitor button to see how the pipeline is running.

2f. Monitor the pipeline

After the data copy tool finishes, you can monitor the running pipeline. You will land in the main pipeline runs pages, with a list of pipelines. In your case, there should be only one:

If you click on the Pipeline name, you will see the detail for each activity in the pipeline. In this case, you should see only one activity in the pipeline, which is the copy of the dataset.

Wait until the status for the activity and pipeline reaches Succeeded. This could take a few minutes, depending on the number of sites in your environment.

Once the pipeline has finished running, the data will be in Azure Storage, in the container and folder that you have specified. It shows as one or more JSON files, plus a metadata folder with information about the request.

3. Create a Power BI Dashboard

The last step is to use the data you just got to build a Power BI dashboard. You will need to:

Create a new Power BI file.
Query the data from the Azure Storage account.
Create your dashboard.

3a. Create a new Power BI file

Now that you have the data in Azure Storage, you can bring it into Power BI to build reports and dashboards. Here is how to get started:

You will start by opening the Power BI desktop application.
If you don’t have the application, download from https://powerbi.microsoft.com/en-us/downloads/

3b. Query the Data

Now you can bring the data into Power BI, directly from Azure.

In your new Power BI report, in the Home tab, click on the Get Data dropdown menu and click on More.

In the list of sources, select Azure, click on Azure Data Lake Storage Gen2 and click on Connect.

Enter the URL with the full path to the ADLS Gen2 data , with container and folder, in the following format:
https://accountname.dfs.core.windows.net/container/folder
This is the Storage Account name that you created in steps 1e and 1g
Click OK

In the next screen you need to authenticate to the storage account.
Select the option to provide an account key, which was mentioned in step 1e.
Click Connect.

In the following screen you will see the list of JSON files coming from the storage account.

Note that you get two JSON files, but keep in mind that one of them is just the metadata file.
Click on the Transform Data button to load all the files into a Power Query.
The Power Query Editor window will show, with the files listed.
First, change the query Name from Query1 to a more meaningful name.
Next, scroll to the left until you find the Folder Path column.
You should see one of the paths that includes a metadata folder. We want to filter that out.

On the row with the Folder Path that includes the word metadata, right click that cell, select the Text Filters option and then the Does Not Contain option. That will get rid of that row only.

Now that you removed the row for the metadata, scroll all the way to the right to find the Content column.
On the Content column, click on the icon with two down arrows called Combine Files (see arrow below).

At this point Power BI does a whole lot to the data, including loading the JSON file, renaming the columns, and expanding the columns with structures (like Storage Metrics and Owner).
You can now just click on the Close and Apply button to close the Query Editor.

3c. Create the Power BI Dashboard

Now that the data is available in Power Bi, let’s create some dashboards.

After you close the Query Editor and go back to the main Power BI window, you will have all the Sites data available to you to create reports and dashboards. They will be under the Fields pane on the right.

The schema for this dataset is available publicly at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/datasetschemas/BasicDataSet_v0.Sha.... This shows the data type and a brief description of each column.
You can now drag visualizations and fields to the main canvas. For instance, you can just double-click on the stacked bar chart in the visualizations and resize the chart to span the entire page. Then drag the RootWeb.WebTemplate field to the Y-axis and the StorageMetrics.TotalSize to the X-axis. That’s it!

4. Conclusion

You have triggered your first pipeline and populated a dashboard. Now there is a lot more that you could do. Here are a few suggestions:

Investigate the many datasets in Data Connect, which you can easily use in your Synapse workspace.
Trigger your pipeline on a schedule, to always have fresh data in your storage account.
Use a Delta pull to get only the data that has changed since your last pull.
Extend your pipeline to do more, like join multiple data sources or take the data to a SQL database.
Publish your Power BI dashboard to share with other people in your tenant.

You can read more about the Microsoft Graph Data Connect at https://aka.ms/mgdcdocs. There you will find many details, including a list of datasets available, complete with schema definitions and samples.

I also keep a list of MGDC for SharePoint links at https://aka.ms/SharePointData.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs