Blog Post

Microsoft Graph Data Connect for SharePoint Blog
9 MIN READ

Step-by-step: (Microsoft Fabric) Gather a detailed dataset on SharePoint Sites using MGDC

Jose_Barreto's avatar
Jose_Barreto
Icon for Microsoft rankMicrosoft
Sep 30, 2024

0. Overview

 

This blog shows a step-by-step guide to getting SharePoint Sites information using the Microsoft Graph Data Connect for SharePoint and the Microsoft Fabric. This includes detailed instructions on how to extract SharePoint and OneDrive site information and use that to run analytics for your tenant.

 

If you follow these steps, you will have a Microsoft Fabric Report like the one shown below, which includes number of sites by type, and total storage used by type. You can also use the many other properties available in the SharePoint Sites dataset.

 

 

To get there, you can split the process into 3 distinct parts:

  • Set up your tenant for the Microsoft Graph Data Connect
  • Get data about SharePoint Sites using Microsoft Fabric
  • Create a report on these sites using Microsoft Fabric

 

Note: Following these instructions will create Azure resources and this will add to your tenant's Azure bill. For more details on pricing, see How can I estimate my Azure bill?

 

1. Setting up the Microsoft Graph Data Connect

 

The first step in the process is to enable the Microsoft Graph Data Connect and its prerequisites. You will need to do a few things to make sure everything is ready to run the pipeline:

  • Enable Data Connect in your Microsoft 365 Admin Center. This is where your Tenant Admin will check the boxes to enable the Data Connect and enable the use of SharePoint datasets.
  • Create an application identity to run your pipelines. This is an application created in Microsoft Entra Id which will be granted the right permission to access MGDC.
  • Create an Azure Resource Group for all the resources we will use for Data Connect, like the Azure Storage account and the Azure Synapse workspace.
  • Create a Fabric Workspace and Fabric Lakehouse to store your MGDC data.
  • Add your Microsoft Graph Data Connect application in the Azure Portal. Your Microsoft Graph Data Connect application needs to be associated with a subscription, resource group, Fabric Workspace and Fabric Lakehouse.
  • Finally, your Global Administrator needs to use the Microsoft Admin Center to approve the Microsoft Graph Data Connect application access.

Let us look at each one of these.

 

1a. Enable the Microsoft Graph Data Connect

 

The first step is to go into the Microsoft 365 Admin Center and enable the Microsoft Graph Data Connect.

  • Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/ and make sure you are signed in as a Global Administrator.
  • Select the option to Show all options on the left.
  • Click on Settings, then on Org settings.

 

  • Select the settings for Microsoft Graph Data Connect.

 

  • Check the box to turn Data Connect on.
  • Make sure to also check the box to enable access to the SharePoint and OneDrive datasets.

IMPORTANT: You must wait 48 hours to onboard your tenant and another 48 hours for the initial data collection and curation. For example, if you check the boxes on August 1st, you will be able to run your first data pull on August 5th, targeting the data for August 3rd. You can continue with the configuration, but do not trigger your pipeline before that.

 

1b. Create the Application Identity

 

 You will need to create an Application in Microsoft Entra ID (formerly Azure Active Directory) and setup an authentication mechanism, like a certificate or a secret. You will use this Application later when you configure the pipeline. 

 

Here are the steps:

  • Navigate to the Azure Portal at https://portal.azure.com
  • Find the Microsoft Entra ID service in the list of Azure services.
  • Select the option for App Registration on the list on the left.
  • Click the link to New Registration to create a new one.

 

  • Enter an app name, select “this organizational directory only” and click on the Register button.
  • On the resulting screen, select the link to Add a certificate or secret.
  • Select the “Client secrets” tab and click on the option for New client secret.
  • Enter a description, select an expiration period, and click the Add button.
  • Copy the secret value (there is a copy button next to it). We will need that secret value later. Secret values can only be viewed immediately after creation. Save the secret before leaving the page.

 

  • Click on the Overview link on the left to view the details about your app registration.
  • Make sure to copy the Directory (tenant) ID and the Application (client) ID, found on the Application’s Overview page. We will need those values later as well.

 

1c. Create the Azure Resource Group

 

You will need to create an Azure Resource Group for all the resources we will use for Data Connect, including the Storage Account and Synapse Workspace.

 

Here are the steps.

 

  • Click on the Create link to create a new resource group.
  • Select a name and a region.

IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant.

 

 

  • Click on Review + Create, make sure you have everything correctly entered and click Create.

 

1d. Create a Fabric Workspace and Fabric Lakehouse

 

Next, you will need to create a Microsoft Fabric Workspace. This is where you will you’re your data-related items like your pipelines and your Lakehouse.

 

Here are the steps:

  • Navigate to the Microsoft Fabric Portal at https://fabric.microsoft.com
  • I the main Microsoft Fabric page, select the option for “Data Engineering”
  • Find the “Workspaces” icon on the bar on the left and click on it.

 

  • At the very bottom of the workspace list, select the option for “+ New workspace”. You may already have a “My workspace”, but it generally best to have a separate workspace for MGDC so it’s easier to organize your projects and collaborate.
  • Give your workspace a name, a description and a license mode. Then click “Apply”.
  • In your empty workspace, using the option for “+ New” and then select “Lakehouse”.

 

  • Give your Lakehouse a name and create it.
  • Please note down the name of your workspace and the name of the Lakehouse. You will need those in upcoming steps.

 

1e. Add your Microsoft Graph Data Connect application

 

Your Microsoft Graph Data Connect application needs to be associated to a subscription, resource group, application identity, Fabric workspace, Fabric Lakehouse and datasets. This will define everything that the app will need to run your pipelines.

 

Here are the steps:

 

  • Select the option to Add a new application.

 

  • Under Application ID, select the one from step 1b and give it a description.

 

  • Select Microsoft Fabric for Compute Type.
  • Select Copy Activity for Activity Type. 

 

  • Fill the form with the correct Subscription and Resource Group (from step 1c).
  • Under Destination Type, select Fabric Lakehouse.
  • Fill the form with the correct Workspace and Fabric Lakehouse (from step 1d).
  • Click on “Next: Datasets”.

 

  • In the dataset page, under Dataset, select BasicDataSet_v0.SharePointSites_v1.
  • Under Columns, select all.
  • Click on “Review + Create”.

 

  • Click “Create” to finish.
  • You will now see the app in the list for Graph Data Connect.

 

1f. Approve the Microsoft Graph Data Connect Application

 

Your last step in this section is to have a Global Administrator approve the Microsoft Graph Data Connect application.

 

  • Make sure this step is performed by a Global administrator who is not the same user that created the application.
  • Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/
  • Select the option to Show all options on the left.
  • Click on Settings, then on Org settings.
  • Click on the tab for Security & privacy.

 

  • Select the option for settings for ‎Microsoft Graph Data Connect‎ applications.

 

  • You will see the app you defined with the status Pending Authorization.
  • Double-click the app name to start the authorization.
  • Follow the wizard to review the app data, the datasets, the columns and the destination, clicking Next after each screen.

 

  • In the last screen, click on Approve to approve the app.

NOTE: The Global administrator that approves the application cannot be the same user that created the application. If it is, the tool will say “app approver and developer cannot be the same user.”

 

  1. Run a Pipeline

 

Next, you will configure a pipeline in Microsoft Fabric. We will use Synapse here. You will trigger this pipeline to pull SharePoint data from Microsoft 365 and drop it on the Azure Storage account. Here is what you will need to do:

  • Go to the Fabric Workspace and create a pipeline
  • Define the source (Dataset from MGDC)
  • Define the destination (Table in the Lakehouse)
  • Save and run the pipeline.
  • Monitor the pipeline to make sure it has finished running.

Let us look at each one of these.

 

2a. Go to the Fabric Workspace and create a pipeline

 

  • Navigate to the Microsoft Fabric Portal at https://fabric.microsoft.com
  • I the main Microsoft Fabric page, select the option for “Data Engineering”
  • Find the “Browse” icon on the bar on the left and click on it.
  • Find the Fabric workspace you defined in step 1d and click on it.

 

  • In the workspace, click on “+ New” item.

 

  • Select “Data pipeline”.
  • Give the new pipeline a name and click on “Create”

 

  • Select the “Copy data assistant” option

 

2b. Define the source (Dataset from MGDC)

 

  • In the first step of the “Copy data assistant”, search for 365 to help you find that source.

 

  • Select the source called “Microsoft365”.

 

  • To create a new connection to the Microsoft365 data source, enter a new connection name, select “Service principal” for Authentication kind and the credentials for the Application registration you created in step 1b, including Tenant ID, Service principal client ID and Service principal Key.
  • Click on “Next” and wait for the dataset information to load.
  • Select “BasicDataSet_v0.SharePointSites_v1” as the table.

 

  • Keep the default scope (SharePoint datasets do not use scope filters).
  • Select SnapshotDate as a column filter and select a date. Since we are doing a full pull, you should use the same date for Start time and End time.
  • Click on “Next

IMPORTANT: Valid dates go from 23 days ago to 2 days ago.

IMPORTANT: You cannot query dates before the date when you enabled SharePoint dataset collection.

 

2c. Define the destination (Table in the Lakehouse)

 

  • Next, you will choose a data destination.

 

  • To start choosing a destination, select the Lakehouse you created in step 1d under the OneLake data hub (not the Azure blobs or ADLS destinations).

 

  • Select the option to “Load to new table”, which will show the column mappings for the Sites dataset.
  • You can keep the default name “BasicDataSet_v0.SharePointSites_v1” or use something simpler like “Sites”.

 

2d. Save and run the pipeline

 

  • The last step in the copy data definition is to review the details

 

  • Click on the “Save + Run” button to save the new task and run it immediately.

 

2e. Monitor the pipeline to make sure it has finished running

 

  • After the assistant ends, you land on the definition of the pipeline, where you can see the copy data activity on the top and you can monitor the running pipeline at the bottom. 

 

  • At this point, your pipeline state should be “queued”, “initializing” or “in progress”. Later it will go into “extracting data” and “persisting data”.
  • Wait for the pipeline to run. This should take around 20 minutes to run, maybe more if your tenant is large.
  • After everything runs, the status will show as “Succeeded”.

 

2f. View the newly created Lakehouse table

 

  • After your pipeline finishes running, you can see the new table
  • Select “Browse” icon on the bar on the left, then click on the Lakehouse.

 

  1. Create a Fabric report

 

The last step is to use the data you just got to build a Power BI dashboard. You will need to:

  • Create a semantic model
  • Create a new Fabric report
  • Add a visualization

 

3a. Create a new semantic model

 

  • Find the “Browse” icon on the bar on the left and click on it.
  • Find the Lakehouse you defined in step 1d and click on it.
  • In the Lakehouse, select the option to create a “new semantic model”.

 

  • Give the semantic model a name and expand the tree to find the newly created “Sites” table. Make sure you select it.
  • Click “Confirm” to create the semantic model

 

3b. Create a new Fabric report

 

  • Find the “Browse” icon on the bar on the left and click on it.
  • Find the Fabric workspace you defined in step 1d and click on it.
  • In the workspace, click on “+ New” item and select “Report”.

 

 

  • Select the option to “pick a published semantic model”
  • Select the semantic model and use the option to “create a blank report”
  • You will end up with an empty report in Power BI, with panels on the right for Filters, Visualizations and Data (from the semantic model).

 

3d. Add a visualization

 

  • Click on the stacked bar chart in the Visualizations to add to the report
  • Resize the bar chart visualization
  • Add the “Id” column to the y-axis property
  • Add the “RootWeb.WebTemplate” to the x-axis property

 

  • Use the “File", "Save as” menu option to name the report “Site Summary”
  • The final report will show

 

  1. Conclusion

 

You have triggered your first pipeline and created your first report using MGDC with the Microsoft Fabric. Now there is a lot more that you could do.

 

Here are a few suggestions:

  1. Investigate the many datasets in the Microsoft Graph Data Connect, which you can easily use in your Microsoft Fabric workspace.
  2. Trigger your pipeline on a schedule, to always have fresh data in your storage account.
  3. Use a Delta pull to get only the data that has changed since your last pull.
  4. Extend your pipeline to do more, like join multiple data sources.
  5. Share your Report with other people in your tenant.

 

You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find links to many details, including a list of datasets available, complete with schema definitions and samples.

 

 

 

Updated Oct 02, 2024
Version 2.0
  • Great questions!

     

    1. The data is delivered in the same way and expanded only when it is moved into the Lakehouse. The MGDC cost is the same for data delivered to JSON files and Lakehouse tables.

     

    2. For the permissions dataset, I would make sure to exclude the column for "Shared With Count" to avoid growing the table further. Also, in general, flattened datasets will need additional columns for a unique identifier. For the permissions dataset, the unique identifier of the flattened dataset would be: SiteId + ScopeId + RoleDefintion + LinkId + Coalesce(SharedWith.AadObjectId, SharedWith.Email, SharedWith.Name)


    We need to use Coalesce because some of the SharedWith values do not have AadObjectId or Email (like SharePoint groups).

  • poongg's avatar
    poongg
    Copper Contributor

    Hi Jose, thank you very much for the information. I have some question regarding copying the dataset to a delta lake table vs json files. I've noticed that when I copy the permission dataset to a Fabric lakehouse, the original array column "SharedWith" is automatically expanded into separate string columns, such as "SharedWith.Name."

     

    1. Is the cost associated with copying the dataset to a Delta Lake table the same as copying it to blob storage as JSON files? I ask because the number of rows in the lake table would be greater than those copied to the JSON files and you know MGDC pricing is calculated based on the number of rows extracted.
    2. Regarding the Delta merge for the permission dataset, as the rows are exploded, what will be the new unique identifier?

    Thanks!

  • +1 to above reply from Jose_Barreto .

     

    poongg  - If you are looking for a delta merge without know what is that deleted in current delta, I would suggest delete all the rows that match with SiteId + ScopeId + RoleDefintion + LinkId from target table and reload the data again for that combination from delta.