Microsoft Graph Data Connect for SharePoint Blog articles

MGDC for SharePoint FAQ: How to Run a PoC without Pulling Your Entire Tenant

Jose_Barreto — Wed, 08 Apr 2026 19:01:16 GMT

Overview

When getting started with SharePoint data in Microsoft Graph Data Connect (MGDC) for SharePoint, many teams want to validate scenarios - such as reporting or analytics before committing to a full production deployment.

A common first instinct is to pull a complete dataset from a production tenant. While this delivers the most comprehensive view of SharePoint usage, it also:

Requires broad administrative authorization
Consumes the most Azure compute and storage resources
Increases MGDC extraction and processing costs
Adds complexity to early experimentation

Fortunately, MGDC for SharePoint provides multiple ways to run low‑cost experiments or proof‑of‑concept (POC) deployments using partial or scoped datasets.

This guide presents these options using a uniform comparison model, helping you choose the right approach based on:

Cost
Representativeness of production behavior
Implementation effort
Dataset completeness
Supported datasets

Archimedes planning...

Option 1: Use a Dev or Test Tenant

Description

Use an existing development or test tenant (or create a new trial tenant) to enable MGDC and run initial experiments.

Pros

Smaller datasets reduce MGDC and Azure costs
Easier to obtain administrative permissions
Lower operational impact

Cons

May not reflect production‑scale usage patterns
Some SharePoint features or integrations may be missing
Requires simulated user activity to generate meaningful data
Trial tenants are time‑limited

Learn More

Option 2: Start with the SharePoint Sites Dataset

Description

The Sites dataset is typically the smallest MGDC dataset for SharePoint and provides tenant‑wide metadata for all site collections.

Pros

Lower cost compared to Files or Permissions datasets
Provides organization‑wide coverage
Minimal MGDC configuration beyond standard onboarding
Small dataset can be handled directly by a variety of analysis tools

Cons

Does not include permission or file details
Limited insight compared to full datasets

Learn More

Option 3: Sample a Limited Number of Rows

Description

Some MGDC SharePoint datasets support returning only a subset of rows in query results. This is supported across the top 5 SharePoint datasets in MGDC (Sites, Permissions, Groups, Files and File Actions).

Pros

Minimal and predictable extraction cost
Enables rapid schema inspection
Provides total dataset row count in request metadata

Cons

Rows are not returned in a predictable order
Sample is not randomized. It is not reproducible and could be biased
Results should not be used to draw tenant‑level conclusions

Learn More

How can I sample or estimate the number of objects in a dataset?

Option 4: Filter by SiteId

Description

Because SharePoint data is partitioned by site collection, MGDC filtering allows you to extract data from a single site or a small group of representative sites. This supports Sites, Permissions, Groups, Files and File Actions datasets.

Pros

Enables realistic workload simulation
Reduces total extraction volume
Simplifies downstream reporting

Cons

May introduce sampling bias
Not suitable for tenant‑wide reporting

Learn More

How can I filter rows on a dataset?

Option 5: Filter by TemplateId

Description

Instead of selecting individual sites, filter by site template to isolate specific workloads. For example, you could filter for OneDrives or SharePoint Embedded.

Pros

Consistent dataset scope
Useful for workload‑specific analysis

Cons

Limited dataset support (supported only for Sites, Files and File Actions)
May not reflect cross‑workload usage patterns

Learn More

How can I filter rows on a dataset?

Option 6: Use Delta State Datasets

Description

Delta datasets allow you to retrieve only changes since your last data transfer for supported SharePoint State datasets.

Pros

Enables recurring analytics with lower extraction costs
Supports daily or weekly trend analysis
Reduces data movement after initial ingestion

Cons

Requires an initial full dataset pull
Adds complexity to downstream merge processing

Learn More

Summary

MGDC for SharePoint provides multiple approaches to extract targeted subsets of tenant data, allowing teams to:

- Run proof‑of‑concept deployments
- Validate analytics pipelines
- Test governance or migration scenarios
- Estimate ongoing MGDC and Azure costs

By selecting the right combination of dataset scope, filtering strategy, sampling method or delta tracking, you can balance cost, representativeness, and implementation effort before scaling to a full production deployment.

For additional guidance on MGDC for SharePoint, visit SharePoint Data in MGDC.

MGDC for SharePoint FAQ: Which dates to use for the Files dataset?

Jose_Barreto — Wed, 08 Oct 2025 22:05:00 GMT

The SharePoint Files dataset became publicly available in Microsoft Graph Data Connect (MGDC) on August 20, 2024. With its release came a unique pricing model ( $0.75 per 50,000 objects, significantly lower than the standard $0.75 per 1,000 objects for other MGDC datasets) and a unique update frequency (data is refreshed weekly, not daily).

Archimedes sculpting a file

Weekly Refresh

Unlike other SharePoint datasets, the Files dataset is refreshed only once a week, over the weekend. This timing helps avoid performance conflicts with tenant activity, as the dataset is typically massive. You can learn more about dataset behavior in the MGDC for SharePoint FAQ: Dataset types and features.

So, which dates should you use for your data pulls? There are two main scenarios: full data pulls and delta pulls.

Full Data Pulls

For a complete snapshot, treat the SharePoint Files dataset like other state-based datasets (Sites, Groups, Permissions). Use the same start and end date to indicate a full pull. Since MGDC data is always delayed by two days, you’ll want to target the most recent refresh.

For instance, if today is Wednesday, October 8, 2025 (2025-10-08), set both your start and end date to Monday, October 6, 2025 (2025-10-06). This aligns with the weekend refresh and ensures you're pulling the latest available data as soon as possible.

Calendar

Note: Time values don’t matter for MGDC for SharePoint. Feel free to leave them at 00:00:00 for both start and end date.

Delta Pulls

To keep your dataset current, use delta pulls. These capture all file additions, updates, and deletions. Details on how delta datasets work are available in the MGDC for SharePoint FAQ: How can I use Delta State Datasets.

Since the dataset updates weekly, pulling deltas more often than that won’t yield new data. Stick to a Wednesday schedule, targeting the previous Monday’s data.

In summary, here's a good strategy for using deltas with Files:

Initial full pull: Run on Wednesday, using Monday’s date for both start and end.
Ongoing delta pulls: Every Wednesday, set the end date to the most recent Monday and the start date to the Monday before that.

For more guidance and documentation, visit MGDC for SharePoint.

MGDC for SharePoint FAQ: Which regions are supported? Updated in September 2025

Jose_Barreto — Wed, 24 Sep 2025 20:55:12 GMT

1. Overview

When gathering SharePoint data through Microsoft Graph Data Connect (MGDC), you need to pay attention to an important detail: regions. That includes both the region of your Microsoft 365 tenant (the data source), the region of the Azure Storage account (the data destination) and the region of your workspace where you run your pipelines.

In this blog, we’ll investigate the importance of understanding regions, which regions are supported by SharePoint on MGDC, the requirement to use a matching region in Azure, and how to handle a tenant with data in multiple regions. We will also share the additional regions supported as of September 2025.

Archimedes looking at the globe

2. Regions

When you use MGDC to pull data from your SharePoint tenant to an Azure account, you always do this in the context of a region. The region is the geographic location where Microsoft stores your data. MGDC and SharePoint support extracting data from several Microsoft 365 regions. As I write this blog, those regions include all three macro regions (Asia-Pacific, Europe and North America), plus many of the go-local regions (like Australia, Brazil and Canada).

3.Which regions are supported by SharePoint on MGDC?

To move data from Microsoft 365 tenant to your Azure account, both must map to a supported region for MGDC. On the Microsoft 365 side, the data location for the SharePoint data must be supported. On the Azure side, the region for your storage account, your Synapse workspace and your Azure Data Factory instance must be supported.

Here is the list of Microsoft 365 and Azure regions supported:

Microsoft 365 region	Azure region	SharePoint Datasets*	All Other Datasets
Asia-Pacific (APC)	East Asia Southeast Asia	Y	Y
Australia (AUS)	Australia East Australia Southeast	Y	Y
Brazil (BRA)	Brazil South	Y	Y
Canada (CAN)	Canada Central Canada East	Y	Y
Europe (EUR)	North Europe West Europe	Y	Y
France (FRA)	France Central	Y	Y
Germany (DEU)	Germany West Central	Y	Y
India (IND)	South India Central India	Y	Y
Israel (ISR)	Israel Central	Y	N
Italy (ITA)	Italy North	Y	N
Japan (JPN)	Japan West Japan East	Y	Y
Korea (KOR)	Korea Central Korea South	Y	Y
Mexico (MEX)	Mexico Central	Y	N
North America (NAM)	Central US East US East US 2 North Central US South Central US West Central US West US West US 2	Y	Y
Norway (NOR)	Norway East	Y	Y
Poland (POL)	Poland Central	Y	N
Qatar (QAT)	Qatar Central	Y	N
South Africa (ZAF)	South Africa North	Y	N
Spain (ESP)	Spain Central	Y	N
Sweden (SWE)	Sweden Central	Y	N
Switzerland (CHE)	Switzerland North	Y	Y
UAE (UAE)	UAE North	Y	Y
United Kingdom (GBR)	UK South UK West	Y	Y

Note: Some of the regions (the ones that were added most recently) are supported only for the SharePoint datasets in Microsoft Graph Data Connect. That includes Sites, Groups, Permissions, File and File Actions. These recently added regions are not supported for OneDrive, Exchange, Teams and other Microsoft Graph Data Connect datasets.

4. The regions must match

So, you need both your Microsoft 365 (Office) region and your Azure region to be supported (listed in the table above). Beyond that, you must also make sure that the Microsoft 365 regions and the Azure regions match. For instance, if your Office region is Europe, you must use an Azure storage account in either North Europe or West Europe.

If you try to pull data for a tenant in the Australia Office region from an Azure account in West US, it will not work. You will basically get no data in this scenario. Even though these are both supported regions, they do not match.

Please note that Microsoft 365 offers local regions in certain geographies, like Germany, France and Norway. Although these are countries in Europe, these country-specific Microsoft 365 regions (also known as go-locals) are not included in the Microsoft 365 Europe region.

5. How do I find my Microsoft 365 (Office) region?

You can use the Microsoft 365 Admin Center to find the region for your tenant. The URL for the Admin center is https://admin.microsoft.com. You will find this information under “Show all”, “Settings”, “Org settings”. Then select the tab for “Organizational Settings” and select “Data location”.

Here are some sample screenshots:

Data Location in Org SettingsData Location per Service

If your tenant is not in one of the supported SharePoint on MGDC regions, please reach out to help us prioritize which regions we should add in the future.

6. How do I find my Azure account region?

When you create Azure resources like a Storage account, you must specify the region where you want your data to be stored. This also applies to other resources in Azure like your Azure Synapse workspace or your Azure Data Factory instance.

See below a list of storage accounts in Azure, showing the region (location):

Azure Storage Account Location

7. What if my tenant uses multiple regions?

It is possible for a Microsoft 365 tenant to have data in multiple regions. This single tenant can choose to have some OneDrive and SharePoint sites in one region and another set of OneDrive and SharePoint sites in a different region. This multi-geo capability is commonly used by large, multi-national companies.

For these tenants, MGDC for SharePoint will only allow you to pull data for one region at a time, using an Azure account that matches that region. For instance, if your tenant has sites in North America and Europe, you will need to pull data from an Azure account in North America to get the north American sites and another Azure account in Europe to pull the European sites.

The priority is to keep the data from the different regions separate, for compliance reasons. Once the multiple datasets are transferred to the multiple Azure accounts, you could join them into a single account, if your tenant’s compliance requirements allow for that.

8. What kind of error will I see?

If you use an unsupported region, you will see an error like this:

Operation on target <ActivityName> failed: ErrorCode=UserErrorOffice365DataLoaderError, ‘Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Office365 data loading failed to execute. office365LoadErrorType: PermanentError. Dataset: <DatasetName> is not available in the region: <UnsupportedRegion>, Source=Microsoft.DataTransfer.ClientLibrary,’

9. Summary

This blog announced additional Microsoft 365 and Azure regions supported in MGDC for SharePoint. For more information about SharePoint Data in MGDC, please visit the collection of links at https://aka.ms/SharePointData.

Capacity Template v2 with Microsoft Fabric

Jose_Barreto — Fri, 05 Sep 2025 23:26:04 GMT

1. Capacity Scenario

One of the most common scenarios for Microsoft Graph Data Connect (MGDC) for SharePoint is Capacity. This scenario focuses on identifying which sites and files are using the most storage, along with understanding the distribution of these large sites and files by properties like type and age.

The MGDC datasets for this scenario are SharePoint Sites and SharePoint Files. If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets.

To assist you in using these datasets, the team has developed a Capacity Template. Initially published as a template for Azure Synapse, we now have a new Microsoft Fabric template that is simpler and offers more features.

This SharePoint Capacity v2 Template, based on Microsoft Fabric, is now publicly available.

2. Instructions

The template comes with a set of detailed instructions at https://aka.ms/fabriccapacitytemplatesteps. These instructions include:

How to install the Microsoft Fabric and Microsoft Graph Data Connect prerequisites

How to import the pipeline template from the Microsoft Fabric gallery and set it up

How to import the Power BI template and configure the data source settings

See below some additional details about the template.

3. Microsoft Fabric Pipeline

After you import the pipeline template, it will look like this:

4. Pipeline in Microsoft Fabric

The Capacity template for Microsoft Fabric includes a few key improvements:

The new template uses delta datasets to update the SharePoint Sites and SharePoint Files datasets. It keeps track of the last time the datasets were pulled by this pipeline, requesting just what changed since then.
The new template uses views to do calculations and create new properties like size bands or date bands. In our previous template, this was done in Power Query, when importing into Power BI.
The new template also uses a view to aggregate file data, grouping the data by file extension.

You can find details on how to find and deploy the Microsoft Fabric template in the instructions (see item 3).

5. Microsoft Fabric Report

The typical result from this solution is a set of Power BI dashboards pulled from the Microsoft Fabric data source. Here are some examples:

These dashboards serve as examples or starting points and can be modified as necessary for various visualizations of the data within these datasets.

The instructions (see item 3) include details on how to find and deploy a few sample Power BI Capacity templates.

6. Conclusion

I hope this provides a good overview of the Capacity template for Microsoft Fabric. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.

Using Copilot (GPT-5) to turn schema into SQL and write queries

Jose_Barreto — Thu, 04 Sep 2025 20:10:51 GMT

This article shows a few ways you can use Copilot to help with Microsoft Graph Data Connect for SharePoint. This includes turning a schema into a SQL CREATE TABLE statement and summarizing data from that. Make sure to use the Copilot option to use GPT-5 for best results.

Archimedes sculpting SQL in marble

Scenario 1

This first scenario just builds a CREATE TABLE statement from the dataset schema definition in markdown notation. I provide a few instructions on how to handle the flattening of the objects.

Me:

Given the schema at this URL: https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md, give me a CREATE TABLE statement for a SQL Server. Just include the single SQL command in the output. Each column should have a comment with a short description. Don't include any information about keys or constraints. This dataset is flattened. One column in the table for every row, except those of type object. Column names to be flattened with a prefix. Use an underline to separate the prefix.

Copilot:

CREATE TABLE SharePointSites ( ptenant NVARCHAR(255) -- Id of the tenant ,Id NVARCHAR(255) -- GUID of the site ,Url NVARCHAR(2048) -- URL for the site ,ArchiveState NVARCHAR(50) -- Archive state of the site: None, Archiving, Archived, or Reactivating ,SiteArchivedBy NVARCHAR(255) -- Who archived the site, null if not archived ,SiteArchivedTime DATETIME -- When the site was archived, null if not archived ,RootWeb_Id NVARCHAR(255) -- Root web id ,RootWeb_Title NVARCHAR(512) -- Root web title ,RootWeb_WebTemplate NVARCHAR(255) -- Root web template name ,RootWeb_WebTemplateId INT -- Root web template id ,RootWeb_Configuration INT -- Root web template configuration id ,RootWeb_LastItemModifiedDate DATETIME -- Date when an item in the root web was last modified ,WebCount BIGINT -- Number of webs (subsites) in the site ,StorageQuota BIGINT -- Total storage in bytes allowed for this site ,StorageUsed BIGINT -- Total storage in bytes used by this site ,StorageMetrics_MetadataSize BIGINT -- Total metadata size for the site in bytes ,StorageMetrics_TotalFileCount BIGINT -- Total number of files for the site ,StorageMetrics_TotalFileStreamSize BIGINT -- Total size of latest version of files in bytes ,StorageMetrics_AdditionalFileStreamSize BIGINT -- Total size of additional file streams in bytes ,StorageMetrics_TotalSize BIGINT -- Total size of all files for the site in bytes ,GroupId NVARCHAR(255) -- Id of the group associated with this site ,GeoLocation NVARCHAR(50) -- Geographic region where the data is stored ,IsInRecycleBin BIT -- Indicates if the site is in the recycle bin ,RecycleBinItemCount BIGINT -- Number of items in the recycle bin ,RecycleBinItemSize BIGINT -- Size of items in the recycle bin ,SecondStageRecycleBinStorageUsage BIGINT -- Size of items in the second stage recycle bin ,IsTeamsConnectedSite BIT -- Indicates if the site is connected to Teams ,IsTeamsChannelSite BIT -- Indicates if the site is a channel site ,TeamsChannelType NVARCHAR(50) -- Type of channel if site is a Teams channel site ,IsHubSite BIT -- Indicates if the site is associated with a hub site ,HubSiteId NVARCHAR(255) -- Id of the hub site for this site ,IsCommunicationSite BIT -- Indicates if the site is a communication site ,IsOneDrive BIT -- Indicates if the site is a OneDrive ,BlockAccessFromUnmanagedDevices BIT -- Site blocks access from unmanaged devices ,BlockDownloadOfAllFilesOnUnmanagedDevices BIT -- Site blocks download of all files from unmanaged devices ,BlockDownloadOfViewableFilesOnUnmanagedDevices BIT -- Site blocks download of viewable files from unmanaged devices ,ShareByEmailEnabled BIT -- Site enables share by e-mail ,ShareByLinkEnabled BIT -- Site enables share by link ,IsExternalSharingEnabled BIT -- Indicates if external sharing is enabled ,SiteConnectedToPrivateGroup BIT -- Indicates if site is connected to a Private Group ,Privacy NVARCHAR(50) -- Privacy of the site: Private or Public ,SensitivityLabelInfo_Id NVARCHAR(255) -- Id of the Sensitivity Label for the site ,SensitivityLabelInfo_DisplayName NVARCHAR(255) -- Display name of the Sensitivity Label for the site ,Classification NVARCHAR(255) -- Classification of the site ,IBMode NVARCHAR(50) -- Information Barriers Mode ,IBSegments NVARCHAR(MAX) -- List of organization segments if IB mode is Explicit ,Owner_AadObjectId NVARCHAR(255) -- AAD Object Id of the owner ,Owner_Email NVARCHAR(320) -- Email of the owner ,Owner_UPN NVARCHAR(320) -- User Principal Name of the owner ,Owner_Name NVARCHAR(255) -- Name of the owner ,SecondaryContact_AadObjectId NVARCHAR(255) -- AAD Object Id of the secondary contact ,SecondaryContact_Email NVARCHAR(320) -- Email of the secondary contact ,SecondaryContact_UPN NVARCHAR(320) -- User Principal Name of the secondary contact ,SecondaryContact_Name NVARCHAR(255) -- Name of the secondary contact ,ReadLocked BIT -- Whether the site is locked for read access ,ReadOnly BIT -- Whether the site is in read-only mode ,CreatedTime DATETIME -- When the site was created (UTC) ,LastSecurityModifiedDate DATETIME -- When security on the site was last changed (UTC) ,LastUserAccessDate DATETIME -- Last access by a real user (UTC) ,LastContentChange DATETIME -- When the site contents were last changed (UTC) ,SnapshotDate DATETIME -- When this site information was captured (UTC) ,Operation NVARCHAR(50) -- Extraction mode of this row );

Scenario 2

Now I use the CREATE TABLE as the base to ask for a SELECT statement.

Me:

Give me a list of sites that have more than 5GB of files. Give me a single SQL statement using the tables described below. <create table statements included here>

Copilot:

SELECT Id AS SiteId, Url AS SiteUrl, RootWeb_Title AS SiteTitle, StorageMetrics_TotalSize AS TotalSizeInBytes, CAST(StorageMetrics_TotalSize / (1024.0 * 1024 * 1024) AS DECIMAL(10,2)) AS TotalSizeInGB FROM SharePointSites WHERE StorageMetrics_TotalSize > (5 * 1024 * 1024 * 1024) ORDER BY StorageMetrics_TotalSize DESC;

Note: Interesting choice of columns and order, without any specific prompting to that effect.

Scenario 3

Now for a more challenging scenario using two tables and a more complex requirement.

Me:

Given the tables below, write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. <create table statements for Sites and Files included here>

Copilot:

WITH ExtensionTotals AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalSize FROM SharePointFiles f WHERE f.Extension IS NOT NULL GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT et.SiteId, et.Extension, et.TotalSize, ROW_NUMBER() OVER (PARTITION BY et.SiteId ORDER BY et.TotalSize DESC) AS rn FROM ExtensionTotals et ), TopExtensions AS ( SELECT SiteId, STRING_AGG(Extension + ' (' + CAST(TotalSize AS VARCHAR(20)) + ')', ', ') WITHIN GROUP (ORDER BY TotalSize DESC) AS Top3Extensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageSize, te.Top3Extensions FROM SharePointSites s LEFT JOIN TopExtensions te ON s.Id = te.SiteId ORDER BY s.StorageMetrics_TotalSize DESC;

Notes: Scenario is nicely handled, using both the Files and Sites tables. Complex aggregation is coded correctly.

Scenario 4

In this one, I skip the step of using a SQL CREATE TABLE statement and go directly from the markdown for the two schemas (Sites and Files) to a SQL SELECT statement. I also let GPT-5 add his comments and notes this time, to make sure it's thinking correctly...

Me:

Consider a SQL table for SharePoint Sites based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md - This dataset is flattened. There is one column in the SQL table for every row, except those of type object. Column names are flattened with a prefix. Column names use an underline to separate the prefix. Also consider a SQL table for SharePoint Files based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointfiles.md - Use the same assumptions as the previous one. Now write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size.

Copilot:

Got it — you want to join the flattened SharePoint Sites table with the flattened SharePoint Files table, aggregate storage usage, and then find the top 3 file extensions by size for each site. Here’s a SQL example that works with the flattened column naming convention you described:

WITH FileSizeByExtension AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalExtensionSize FROM SharePointFiles f GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT SiteId, Extension, TotalExtensionSize, ROW_NUMBER() OVER (PARTITION BY SiteId ORDER BY TotalExtensionSize DESC) AS rn FROM FileSizeByExtension ), Top3Extensions AS ( SELECT SiteId, STRING_AGG(Extension, ', ' ORDER BY TotalExtensionSize DESC) AS TopExtensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageBytes, t.TopExtensions FROM SharePointSites s LEFT JOIN Top3Extensions t ON s.Id = t.SiteId ORDER BY s.StorageMetrics_TotalSize DESC;

How it works

FileSizeByExtension: Groups files by SiteId and Extension to sum their sizes.

RankedExtensions: Uses ROW_NUMBER() to rank extensions by size within each site.

Top3Extensions: Filters to the top 3 per site and concatenates them into a comma-separated list.

Final SELECT: Joins back to the Sites table to get the site name and total storage size.

Notes

RootWeb_Title comes from flattening the RootWeb object in the Sites dataset.

StorageMetrics_TotalSize is the flattened column for total site storage (excluding recycle bin).

STRING_AGG works in SQL Server 2017+, Azure SQL, and some other RDBMS.

If your SQL dialect doesn’t support it, you’d need a different aggregation method.

This query orders sites by total storage size so the largest sites appear first.

Note: Nicely done in one prompt with the proper explanations.

Conclusion

I hope these examples help you see the potential of getting Copilot's help with your queries. This is one more reason to load your MGDC for SharePoint tables directly into Microsoft Fabric, where you can run SQL queries against tables in the OneLake...

Microsoft Graph Data Connect for SharePoint is Generally Available

Jose_Barreto — Tue, 01 Jul 2025 15:52:58 GMT

General Availability (GA) for Microsoft Graph Data Connect was announced on August 2023, about two years ago. Details can be found at Microsoft Graph Data Connect updates. At that time, the SharePoint datasets for Microsoft Graph Data Connect entered public preview.

As of July 2025, Microsoft Graph Data Connect for SharePoint, which includes datasets for OneDrive and SharePoint, has reached General Availability (GA), following an extended public preview period. There are no changes to the datasets or features at this time.

For more information about the datasets, visit https://aka.ms/OneDriveAndSharePointDatasets.

MGDC for SharePoint FAQ: How to List SharePoint Webs or Subsites?

Jose_Barreto — Thu, 12 Jun 2025 19:02:39 GMT

1. Introduction

SharePoint uses a known object hierarchy to store your files, including, in order:

Site Collection (site, SPSite)
Web (site, subsite, SPWeb)
List (SPList, Document Library, SPDocumentLibrary)
Folder (SPFolder)
List Item (documents, files, SPListItem, SPFile)

If you are not familiar with all those, you can read more about them at What is in the Permissions dataset?

2.What we have in MGDC for SharePoint

The SharePoint Sites dataset in Microsoft Graph Data Connect offers a list of all sites (site collections) in your tenant. The SharePoint Files will give you details about each file (document). However, there are no MGDC datasets that will give you one object per SharePoint Web (subsite) or SharePoint List (library).

There is a workaround, though. If you pay attention to the schema for the SharePoint Sites and SharePoint Files datasets in MGDC, you will find that Sites includes a few interesting details about the Root Web (the main web in the site collection) and the total number of webs (subsites) in each site. Also, the Files dataset includes details about the web (subsite) and the list (library) where the file lives. You could group on these columns to expose more information about webs and lists.

3. One Web

These days, if you create a modern SharePoint site, you will get one web (called the Root Web) and no obvious mechanism to create subsites or additional webs. For any sites created in the last few years, you likely have no additional webs outside the Root Web.

Your SharePoint might have existed for many years and your company could have sites using old templates with multiple subsites. If you’re trying to find out if that’s the case for your tenant, the information is available in MGDC, if you are willing to dig a little deeper.

4. Finding Old Subsites

The main task here is to create a custom query to join the Sites and Files datasets. You would then group that in a particular way to enumerate each subsites (Webs). You might also tag the Webs that are a Root Web, so that you can identify subsites that are not at the top of the Site hierarchy.

If you pulled the Sites and Files datasets into a SQL Server or a Lakehouse in Microsoft Fabric, this is what that query could look like:

SELECT Webs.SiteId, Sites.RootWeb.Title AS SiteTitle, Sites.Url AS SiteUrl, Sites.WebCount, Webs.WebId, CASE WHEN Root.RootWeb.Id IS NOT NULL THEN 1 ELSE 0 END AS IsRootWeb, Webs.FileCount, Webs.TotalSize, Webs.TotalSizeWithVersions FROM ( SELECT SiteId, WebId, COUNT(*) AS FileCount, SUM(SizeInBytes) AS TotalSize, SUM(SizeInBytesWithVersions) AS TotalSizeWithVersions FROM Files GROUP BY SiteId, WebId ) AS Webs LEFT JOIN Sites AS Sites ON Webs.SiteId = Sites.Id LEFT JOIN Sites AS Root ON Webs.SiteId = Root.Id AND Webs.WebId = Root.RootWeb.Id

4. Empty Webs, Web properties

This technique will get you a nice list of Webs, but there are a few problems.

First, if a specific web has no files, we won’t see that web in the Files dataset and therefore the query will fail to list that particular web. You would still see that Web counted in the WebCount property of the Sites dataset, but there would be no listing of that web in the query proposed above, since it has no files.

Another problem is that there is no way to find in MGDC the details properties of the Web, like the Title. You get some of those for the Root Web in the Sites dataset, but not for the other Webs. For that, you would have to query the Web object using another source like the Graph API or PowerShell.

5. Conclusion

I hope this article helped you find details about enumerating SharePoint Webs using the Microsoft Graph Data Connect. For recent deployments of SharePoint Online, you should have only one web per site and the Sites dataset (with Root Web details) should give you plenty of details. However, this might come in handy if you want to manage some of the old subsites of a slightly older SharePoint deployment.

Finally, if you’re looking for a challenge, I would suggest adapting the SQL code above to enumerate all the document libraries using the Files dataset. It can be done following a similar pattern…

Book of Archimedes, Copilot and SharePoint Agents

Jose_Barreto — Thu, 12 Jun 2025 20:02:14 GMT

1. Introduction

The Book of Archimedes is a single PDF that contains all posts in the Microsoft Graph Data Connect for SharePoint blog. This includes 40 articles and nearly 400 pages, which is handy for searching for specific keywords quickly across the many articles.

The PDF is provided as an attachment to the main MGDC for SharePoint article at https://aka.ms/SharePointData. You can also download it directly from Book of Archimedes.

2. Using the Book with Copilot

Beyond searching inside the PDF, you can feed the Book of Archimedes as grounding to your favorite AI tool to get answers to your questions about MGDC for SharePoint. Let's look at an example.

You can use this with Copilot or any AI tool that accepts attached files. In my case, I used https://copilot.cloud.microsoft/ to access my work Copilot, the secure and compliant Copilot integrated with my enterprise account.

I start with the “+” button and then the “add content” button.

Then I used the up-arrow icon to “upload from this device” and selected the local copy of the Book of Archimedes I downloaded earlier.

After that, I added my question as a prompt. I tried “What is Project Archimedes?”

I got a summary of Microsoft Graph Data Connect for SharePoint, including key scenarios. From there, I could ask any question. If the book contains the answer, Copilot will find it for you.

Project Archimedes is the early codename for Microsoft Graph Data Connect for SharePoint. This project aims to deliver rich data assets to OneDrive and SharePoint tenants, enabling them to run their own analytics, derive insights from their data, and understand how they use these products. The data is transferred to an Azure account owned by the tenant, where tools like Azure Synapse, Power BI, or Microsoft Fabric can be used to transform this data into insightful reports and dashboards [1].

Key Scenarios of Project Archimedes:

Security: Understanding permissions in SharePoint and OneDrive tenants to ensure content is properly protected.
Capacity: Understanding site lifecycle, ownership, and storage used by SharePoint sites and OneDrives.
Sync Health: Ensuring users are properly using OneDrive for Business to protect their files by synchronizing them with the cloud [1].

Datasets Provided:

Sites: Information about SharePoint sites.
Groups: Details about SharePoint groups.
Permissions: Information about sharing permissions.
Files: Data about files stored in SharePoint and OneDrive.
File Actions: Details about actions performed on files.
Sync Health: Information about devices running OneDrive Sync.
Sync Errors: Details about errors encountered by OneDrive Sync clients [1].

Example Use Cases:

Security Scenario: Answering questions about oversharing, external sharing, sensitive data sharing, and permissions.
Capacity Scenario: Tracking storage usage, site lifecycle, and identifying large or inactive sites.
Sync Health Scenario: Monitoring device health, folder backup status, and sync errors [1].

For more detailed information, you can refer to the Book of Archimedes which contains comprehensive guides, FAQs, and step-by-step instructions related to Microsoft Graph Data Connect for SharePoint [1].

[1]: Book of Archimedes.pdf

The more impressive capabilities of Copilot are shown when you ask for something that is not just a direct quote from the Book. For instance, I asked about pricing for a specific scenario: "If a tenant has 1,000 SharePoint sites, how much would it cost to get the SharePoint Sites and SharePoint permissions for that tenant using MGDC?"

To estimate the cost of getting the SharePoint Sites and SharePoint Permissions datasets for a tenant using Microsoft Graph Data Connect (MGDC), you can follow these guidelines:

Pricing Information:

SharePoint Sites and Permissions: The price to pull 1,000 objects from the available MGDC for SharePoint datasets (Sites, Groups, and Permissions) in the US is $0.75, plus the cost for infrastructure like Azure Storage, Azure Data Factory, or Azure Synapse [1].
SharePoint Files: This dataset is billed at $0.75 per 50,000 objects due to its typical high volume [1].

Estimating Costs:

Number of Objects:
- Sites: The number of objects in your SharePoint Sites dataset will be equal to the number of sites in your tenant.
- Permissions: The average number of SharePoint permissions per site is around 61 [1].
Cost Calculation:

Example Calculation:

For a tenant with 1,000 sites:

Sites Dataset Cost: $0.75
Permissions Dataset Cost: $45.75
Total Cost: $0.75 + $45.75 = $46.50

Additional Costs:

Azure Storage: Costs for storing the data in Azure Storage.
Azure Synapse: Costs for running pipelines to move the data.

For more precise estimates, you can sample the dataset to get a total object count without pulling the entire dataset [1].

[1]: Book of Archimedes.pdf

Copilot will also answer questions in other languages. You just ask the question in whatever language you want, and it gets detected… Here’s an example using Brazilian Portuguese. I asked "Quais são os datasets que fazem parte do cenário de segurança?"

Os datasets que fazem parte do cenário de segurança no Project Archimedes são:

Sites: Informações sobre os sites do SharePoint.
Permissions: Informações sobre as permissões de compartilhamento.
Groups: Detalhes sobre os grupos do SharePoint.

Esses datasets ajudam a entender se o conteúdo está devidamente protegido, respondendo a perguntas como:

Está ocorrendo compartilhamento excessivo?
Está ocorrendo compartilhamento externo?
Dados sensíveis estão sendo compartilhados?
Quanto compartilhamento ocorre por rótulo de sensibilidade?
Dados sensíveis são compartilhados com usuários externos?
Quais domínios externos estão sendo compartilhados?
Quais sites foram mais compartilhados?
Quais níveis de compartilhamento estão sendo usados?
Quais permissões um usuário específico possui?
Quais extensões de arquivo são mais compartilhadas?
Quanto compartilhamento ocorre no nível de Web, Pasta, Lista ou Arquivo?

Esses datasets são fundamentais para garantir a segurança e a conformidade dos dados dentro do ambiente do SharePoint e OneDrive [1].

[1]: Book of Archimedes.pdf

3. Using the Book to create a SharePoint agent

You can also use the book to create a SharePoint agent. This will make it easy to ask questions about Microsoft Graph Data Connect for SharePoint without having to upload the PDF every time.

To start, upload the Book of Archimedes to a SharePoint document library.

Next, select the file and use the option to “create an agent”.

That’s it. Your agent is created, and you can start using it immediately.

The agent shows up in the document library as a file with the “.agent” extension. You can simply open that file in SharePoint to run the agent.

Once you open the agent, it offers you a text box to ask questions.

Here is the agent response to “What are the main questions about SharePoint capacity that MGDC for SharePoint can answer?”

The Book of Archimedes provides detailed answers to several key questions about SharePoint capacity using Microsoft Graph Data Connect (MGDC). Here are the main questions it addresses:

Key Questions about SharePoint Capacity:

How much storage is being used by SharePoint sites?
- This includes understanding the total size of the latest versions of files, previous versions, metadata size, and deleted files in the recycle bin 1.
What are the different storage sizes reported for a SharePoint site?
- Storage Used: Includes all storage, metadata, previous versions, and recycle bins.
- Total Size: Includes metadata and previous versions but excludes recycle bins.
- Metadata Size: Storage used by file metadata.
- Total File Stream Size: Size of the latest versions of files.
- Previous Versions Size: Size of previous versions of files.
- Recycle Bin Size: Size of deleted files in the recycle bins 1.
How can administrators check the amount of storage being used by SharePoint sites?
- SharePoint client UX: Viewing storage while using SharePoint as a site user or Site Collection administrator.
- PowerShell cmdlets: Using command-line tools to query site storage information.
- SharePoint Admin Center reports: Viewing usage reports in the Admin Center.
- Microsoft Graph Data Connect: Pulling detailed datasets for SharePoint sites and OneDrives 1.
How can administrators create test sites to understand storage reporting tools?
- Creating different types of test sites (empty site, site with files, site with deleted files, site with previous versions) and analyzing how storage sizes are reported 1.
How can administrators estimate Azure costs for MGDC usage?
- Estimating the number of objects (sites, groups, permissions, files) and understanding the billing rates for different datasets 1.
How can administrators use MGDC to run detailed analysis of SharePoint and OneDrive storage metrics?
- Using detailed data in MGDC to analyze storage metrics and understand the impact of metadata, versions, and deleted files 1.

Tools and Templates:

Capacity Template: Includes instructions, ARM template, Azure Synapse pipeline template, and Power BI template to help administrators use MGDC datasets for capacity analysis 1.

Conclusion:

The Book of Archimedes provides comprehensive guidance on leveraging MGDC to gather, store, and analyze SharePoint data, ensuring effective data management and security. It is a valuable resource for administrators and developers working with SharePoint and OneDrive data using MGDC 1.

Would you like more specific information on any of these questions or tools?

1: <File>Book of Archimedes</File>

As we did with Copilot, you can also ask any question on a topic addressed in the book. You can also use languages other than English.

You can experiment with editing the agent to rename it, change the description, add more sources, adjust its behavior, offer sample starter prompts, include specific instructions or even customize with Copilot Studio. Just use the option to "edit the agent". You can also share the agent just like you share any file in SharePoint.

You can learn more about SharePoint agents in this YouTube video by Vesa Juvonen: https://www.youtube.com/watch?v=pYvK76SVEv4

4. Conclusion

I hope you enjoy using the Book of Archimedes as grounding for your AI tools.

As usual, remember that AI tools may make mistakes and you should always double-check the answers you receive.

There is also a chance that a particular AI tool might not be able to handle a large document like the Book of Archimedes, currently weighing in at around 10MB.

Understanding the Notebooks in the Oversharing Template v2 (Microsoft Fabric)

Jose_Barreto — Wed, 09 Apr 2025 23:35:19 GMT

Introduction

The Microsoft Graph Data Connect for SharePoint team published two notebooks used with Microsoft Fabric in the Information Oversharing v2 template. This blog explains what each code block inside these notebooks does. The goal was to help you understand what the notebooks do.

Note that this document was written with help from Copilot, using simple prompts like “Analyze each section of this Jupyter notebook with PySpark and Scala code. Describe what each section does.”

Notebook 1: Read Last Snapshot Dates

This first notebook runs right as the pipeline starts. It checks the environment, verifies if the Sites and Permission tables exist in the Lakehouse, checks the last day data was gathered from MGDC and calculates the start and end date to use. It also cleans the staging tables and stores a few commands that are used in later steps.

Section 0 – Set the Default Lakehouse for Notebook to Run

%%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } }

This section uses the %%configure magic command to set a JSON configuration that defines a parameter (lakehouseName) with the default value "defaultlakehousename". This setting ensures that when the notebook is launched through a pipeline, it dynamically selects the target Lakehouse.

Section 1 – Initialize Parameters

import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar import java.sql.Timestamp val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = "IMAXDefault" val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" val endTime = "2024-11-15T00:00:00Z" spark.conf.set("spark.sql.caseSensitive", true)

This section imports various libraries for date/time handling and initializes key parameters for the ETL process. These include a run identifier (runId), workspace and Lakehouse information (with some values coming from Spark configuration), table names for staging and final datasets, and a fallback endTime. It also enforces case sensitivity in Spark SQL.

Section 2 – Checking Required Final Tables Exist or Not

val lakehouse = mssparkutils.lakehouse.get(lakehouseName) val lakehouseId = lakehouse.id val workspaceName = notebookutils.runtime.context("currentWorkspaceName") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val sitesStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesStagingTableName}" val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val tables = spark.catalog.listTables() val siteTableCount = tables.filter(col("name") === lit(sitesFinalTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsTableCount = tables.filter(col("name") === lit(permissionsFinalTableName) and array_contains(col("namespace"), lakehouseName)).count() val siteStagingTableCount = tables.filter(col("name") === lit(sitesStagingTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsStagingTableCount = tables.filter(col("name") === lit(permissionsStagingTableName) and array_contains(col("namespace"), lakehouseName)).count()

This section retrieves the Lakehouse object and uses it to construct ABFS paths for both staging and final tables (for Sites and Permissions). It then checks for the existence of these tables by listing them in Spark’s catalog and filtering by name and namespace.

Section 3 – Getting Snapshot Dates from Last Successful Extracts

import org.apache.spark.sql.functions.{col, _} import org.apache.spark.sql.types._ import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val dtCurrentDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.S") val dtRequiredtDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss'Z'") var siteDataExists: Boolean = false var permissionsDataExists: Boolean = false val siteSnapshotDate = { if (siteTableCount == 1) { val dfSites = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${sitesFinalTableName} ") val rowSites: Row = dfSites.select("SnapshotDate").head(1)(0) if (rowSites.get(0) == null) endTime else { siteDataExists = true println(s"Sites data Exists: ${siteDataExists}") LocalDateTime.parse(rowSites.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } } val permissionsSnapshotDate = { if (permissionsTableCount == 1) { val dfPermissions = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${permissionsFinalTableName} ") val rowPermissions: Row = dfPermissions.select("SnapshotDate").head(1)(0) if (rowPermissions.get(0) == null) endTime else { permissionsDataExists = true println(s"Permissions data Exists: ${permissionsDataExists}") LocalDateTime.parse(rowPermissions.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } }

This section queries the final tables to retrieve the latest SnapshotDate for both Sites and Permissions. It then reformats the date into an ISO-compliant format. If no snapshot date is found, it defaults to the predefined endTime, and two boolean flags (siteDataExists and permissionsDataExists) are toggled accordingly.

Section 4 – Generate View Script for Sites

val sitesView: String = s""" CREATE OR ALTER VIEW vw${sitesFinalTableName} AS SELECT *, [StorageQuotaFriendly] = (case when StorageQuota < 1048576 then concat(ceiling(StorageQuota / 1024.0), ' KB') when StorageQuota < 1073741824 then concat(ceiling(StorageQuota / 1048576.0), ' MB') when StorageQuota < 1099511627776 then concat(ceiling(StorageQuota / 1073741824.0), ' GB') when StorageQuota < 1125899906842624 then concat(ceiling(StorageQuota / 1099511627776.0), ' TB') else concat(ceiling(StorageQuota / 1125899906842624.0), ' PB') end ), [StorageUsedFriendly] = (case when StorageUsed < 1048576 then concat(ceiling(StorageUsed / 1024.0), ' KB') when StorageUsed < 1073741824 then concat(ceiling(StorageUsed / 1048576.0), ' MB') when StorageUsed < 1099511627776 then concat(ceiling(StorageUsed / 1073741824.0), ' GB') when StorageUsed < 1125899906842624 then concat(ceiling(StorageUsed / 1099511627776.0), ' TB') else concat(ceiling(StorageUsed / 1125899906842624.0), ' PB') end ) FROM ${sitesFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(sitesView)

Here a SQL view (vwSites) is dynamically generated for the Sites final table. The view adds two computed columns (StorageQuotaFriendly and StorageUsedFriendly) that convert byte values into more digestible units such as KB, MB, GB, etc. This script will be stored and executed later.

Section 5 – Generate View Script for Permissions

val permissionsView: String = s""" CREATE OR ALTER VIEW vw${permissionsFinalTableName} AS SELECT *, ShareeDomain = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 AND CHARINDEX('.', SharedWith_Email) > 0 THEN SUBSTRING(SharedWith_Email,CHARINDEX('@', SharedWith_Email)+1,LEN(SharedWith_Email)) ELSE '' END, ShareeEMail = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 THEN SharedWith_Email ELSE '' END, PermissionsUniqueKey = CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000')), EEEUPermissionsCount = SUM(CASE WHEN SharedWith_Name LIKE 'Everyone except external users' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), ExternalUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'External' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), B2BUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'B2BUser' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ) FROM ${permissionsFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(permissionsView)

This section builds a SQL view (vwPermissions) for the Permissions final table. It derives additional columns like ShareeDomain, ShareeEMail, and a composite key (PermissionsUniqueKey) while applying window functions to compute counts (e.g., for external or B2B users). This script will also be stored and executed later.

Section 6 – Truncate the Staging Tables from Previous Runs

if (siteStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${sitesStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${sitesStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${sitesFinalTableName} not found") } if (permissionsStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${permissionsStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${permissionsStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${permissionsStagingTableName} not found") }

This section checks if the staging tables exist (by count) and, if found, issues a SQL DELETE command to remove existing data so that new data can be loaded. It prints messages indicating the action taken.

Section 7 – Return Snapshot Dates Back to Pipeline

import mssparkutils.notebook val returnData = s"""{\"LakehouseId\": \"${lakehouseId}\", \"SitesStagingTableName\": \"${sitesStagingTableName}\", \"SitesFinalTableName\": \"${sitesFinalTableName}\", \"SitesSnapshotDate\": \"${siteSnapshotDate}\", \"SitesDataExists\": ${siteDataExists}, \"SitesView\": \"${sitesView}\", \"PermissionsStagingTableName\": \"${permissionsStagingTableName}\", \"PermissionsFinalTableName\": \"${permissionsFinalTableName}\", \"PermissionsSnapshotDate\": \"${permissionsSnapshotDate}\", \"EndSnapshotDate\": \"${endTime}\", \"PermissionsDataExists\": ${permissionsDataExists}, \"PermissionsView\": \"${permissionsView}\"}""" println(returnData) mssparkutils.notebook.exit(returnData)

This concluding section aggregates the key metadata—including Lakehouse information, table names, snapshot dates, existence flags, and the generated view scripts—into a JSON string. It then exits the notebook by returning that JSON to the pipeline.

Notebook 2: Merge Sites and Permissions to Final Table

This notebook runs after the Sites and Permissions data from MGDC has been collected successfully into the staging tables. If this is the first collection, it handles them as full datasets, storing the data directly in the final tables. If this is using MGDC for SharePoint delta datasets, it merges the new, updated or deleted objects from the staging tables into the final tables.

Note: The word "Delta" here might refer to Delta Parquet (an efficient data storage format used by tables in a Microsoft Fabric Lakehouse) or to the MGDC for SharePoint Delta datasets (how MGDC can return only the objects that are new, updated or deleted between two dates). It can be a bit confusing, so be aware of the two interpretations of the word.

Section 0 – Set the Default Lakehouse for Notebook to Run

%%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } }

This section uses the same Lakehouse configuration as in Notebook 1. It sets the default Lakehouse through a parameter (lakehouseName) to support dynamic running of the notebook in different environments.

Section 1 – Initialize Parameters

import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = spark.conf.get("trident.lakehouse.name") val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" spark.conf.set("spark.sql.caseSensitive", true)

This section is like Notebook 1’s Section 1 but here lakehouseName is retrieved from the configuration. It initializes variables needed for merging, such as run IDs, workspace/Lakehouse identifiers, and table names.

Section 2 – Read Sites Dataset from Staging Table

This section constructs the ABFS path for the Sites staging table and reads the dataset into a DataFrame using the Delta Parquet format. It includes print statements to track progress.

Section 3 – Read Permissions Dataset from Staging Table

println("Started reading Permissions dataset") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val dfPermissionsStaging = spark.read.format("delta").load(permissionsStagingLocation) println("Completed reading Permissions dataset")

This section performs the analogous operation for the Permissions staging table, loading the dataset into a DataFrame and providing console output for monitoring.

Section 4 – Check Final Tables Exist or Not

import io.delta.tables.DeltaTable val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val sitesFinalTableExists = DeltaTable.isDeltaTable(spark, sitesFinalLocation) if (!sitesFinalTableExists) { println("Final Sites table not exists. Creating final Sites table with schema only") dfSitesStaging.filter("1=2").write.format("delta").mode("overwrite").save(sitesFinalLocation) println("Final Sites table created") } else { println("Final Sites table exists already") } val permissionsFinalTableExists = DeltaTable.isDeltaTable(spark, permissionsFinalLocation) if (!permissionsFinalTableExists) { println("Final Permissions table not exists. Creating final Permissions table with schema only") dfPermissionsStaging.filter("1=2").write.format("delta").mode("overwrite").save(permissionsFinalLocation) println("Final Permissions table created") } else { println("Final Permissions table exists already") }

This section checks whether the final tables for Sites and Permissions exist. If a table does not exist, it creates an empty table (schema only) from the staging DataFrame by filtering out data (filter("1=2")).

Section 5 – Merge Sites Data from Staging Table to Final Table

This section performs a Delta Lake merge (upsert) operation on the Sites data. The merge logic deletes matching records when the source’s Operation is 'Deleted', updates other matching records, and inserts new records that are not marked as 'Deleted'.

Section 6 – Merge Permissions Data from Staging Table to Final Table

import io.delta.tables._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.{Window, WindowSpec} import org.apache.spark.sql.functions.{coalesce, lit, sum, col, _} import org.apache.spark.sql.types.{StructField, _} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val deltaTablePermissionsSource = DeltaTable.forPath(spark, permissionsStagingLocation) val deltaTablePermissionsTarget = DeltaTable.forPath(spark, permissionsFinalLocation) import spark.implicits._ val dfPermissionsSource = deltaTablePermissionsSource.toDF //Delete records that have Operation as Deleted println("Merging Permissions dataset from current staging table") deltaTablePermissionsTarget .as("target") .merge( dfPermissionsSource.as("source"), """source.SiteId = target.SiteId and source.ScopeId = target.ScopeId and source.LinkId = target.LinkId and source.RoleDefinition = target.RoleDefinition and coalesce(source.SharedWith_Name,"") = coalesce(target.SharedWith_Name,"") and coalesce(source.SharedWith_TypeV2,"") = coalesce(target.SharedWith_TypeV2,"") and coalesce(source.SharedWith_Email,"") = coalesce(target.SharedWith_Email,"") and coalesce(source.SharedWith_AADObjectId,"") = coalesce(target.SharedWith_AADObjectId,"") """) .whenMatched("source. Operation = 'Deleted'") .delete() .whenMatched("source.Operation != 'Deleted'") .updateAll() .whenNotMatched("source.Operation != 'Deleted'") .insertAll() .execute() println("Merging of Permissions dataset completed")

This section performs a merge operation on the Permissions data. The merge condition is more complex—comparing multiple columns (including handling nulls with coalesce) to identify matching records. The operation applies deletion for rows marked as 'Deleted', updates others, and inserts records where no match exists.

Section 7 – Read and Display Sample TOP 10 Rows

var sqlQuery = s"SELECT * FROM ${lakehouseName}.${sitesFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfSitesAll = spark.sql(sqlQuery) display(dfSitesAll) sqlQuery = s"SELECT * FROM ${lakehouseName}.${permissionsFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfPermissionsAll = spark.sql(sqlQuery) display(dfPermissionsAll)

This final section executes SQL queries to retrieve and display the top 10 rows from both the Sites and Permissions final tables. The rows are ordered by SnapshotDate in descending order. This is typically used for sample or debugging purposes.

Conclusion

I hope this article helped you understand the notebooks included in the template. This might help you customize it later. These templates are intended as starting points for your work with many scenarios.

Read more about MGDC for SharePoint at https://aka.ms/SharePointData.

MGDC for SharePoint: New Columns in Q1 CY2025

Jose_Barreto — Fri, 28 Mar 2025 16:35:18 GMT

The MGDC for SharePoint team continues to improve the available datasets.

This time around, we are introducing additional columns to some of our existing datasets.

2 new archive-related properties in the Sites dataset
2 new storage-related properties in the Sites dataset
4 new owner and member properties in the Groups dataset
1 new sharee property in the Permissions dataset

Here is a summary of the 9 changes in the 3 datasets (Sites, Groups and Permissions):

Dataset	Column	Type	Description
Sites	SiteArchivedBy	String	Who archived the site. Null if not archived
Sites	SiteArchivedTime	DateTime	When the site was archived. Null if not archived
Sites	LastContentChange	DateTime	When the site contents were last changed (in UTC)
Sites	StorageMetrics. AdditionalFileStreamSize	int64	Total size of additional file streams in bytes, excluding the latest version, previous versions and metadata
Groups	Owner. LoginName	String	Login name of group owner. If ending with _o, this should be expanded using the owners of the group, not the members
Groups	Owner. UPN	String	User Principal Name of group owner
Groups	Member. LoginName	String	Login name of group member. If ending with _o, this should be expanded using the owners of the group, not the members
Groups	Member. UPN	String	User Principal Name of group member
Permissions	SharedWith. UserLoginName	String	Login name for the sharee. If ending with _o, this should be expanded using the owners of the group, not the members

* When the LoginName ends with _o, that means that this group should be expanded using the owners of the group, not the members of the group.

These changes are non-breaking, which means that your existing pipelines will continue to run. We also let this change sit for 21 days, so that delta datasets are also not impacted.

Anyone starting fresh with MGDC for SharePoint will automatically get the new columns in their pipelines. If you have existing pipelines and want to use the new columns, you will need to update your MGDC application, get that update approved and then update your pipeline to pull the new schema.

For more details about updating your pipelines to include new columns, or to see how these can be leveraged in specific scenarios, see these articles:

How can I track the lifecycle of a SharePoint site?
How are SharePoint Groups and Security Groups used together?
MGDC for SharePoint FAQ: How to deal with schema changes
Main MGDC for SharePoint article: https://aka.ms/SharePointData

To view the complete schema for all SharePoint datasets in MGDC, visit https://aka.ms/SharePointDatasets

Information Oversharing v2 Template with Microsoft Fabric

Jose_Barreto — Fri, 07 Mar 2025 22:54:05 GMT

1. Security Scenario

One of the most common scenarios for Microsoft Graph Data Connect (MGDC) for SharePoint is Information Oversharing. This security scenario focuses on identifying which items are being widely shared within the tenant and understanding how permissions are applied at each level.

The MGDC datasets for this scenario are SharePoint Sites and SharePoint Permissions. If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets.

To assist you in using these datasets, the team has developed an Information Oversharing Template. Initially published as a template for Azure Synapse, we now have a new Microsoft Fabric template that is simpler and offers more features.

The SharePoint Information Oversharing v2 template, based on Microsoft Fabric, is now publicly available.

2. Instructions

The template comes with a set of detailed instructions at https://aka.ms/fabricoversharingtemplatesteps. These instructions include:

How to install the Microsoft Fabric and Microsoft Graph Data Connect prerequisites

How to import the pipeline template from the Microsoft Fabric gallery and set it up

How to import the Power BI template and configure the data source settings

See below some additional details about the template.

3. Microsoft Fabric Pipeline

After you import the pipeline template, it will look like this:

Pipeline in Microsoft Fabric

The Information Oversharing template for Microsoft Fabric includes a few key improvements:

It uses the new UserCount and TotalUserCount properties in the SharePoint Permissions dataset, which means you do not need to pull the SharePoint Groups or the three Microsoft Entra ID Group datasets to calculate the number of users being granted access. This optimization will greatly reduce the cost to get a report of the sites shared with the most users.

The new template also uses delta datasets to update the SharePoint Sites and SharePoint Permissions datasets. It keeps track of the last time the datasets were pulled by this pipeline, requesting just what changed since then.
As the previous template, this one also flattens the SharePoint Permissions dataset, creating one permission row for each "Shared With" inside the permission. So, if a file is shared with three people, the SharePoint dataset will show one row, but the flattened data stored in Microsoft Fabric will show three rows.

You can find details on how to find and deploy the Microsoft Fabric template in the instructions (see item 3).

4. Microsoft Fabric Report

The typical result from this solution is a set of Power BI dashboards pulled from the Microsoft Fabric data source. Here is an example:

Power BI Sample Dashboard

These dashboards serve as examples or starting points and can be modified as necessary for various visualizations of the data within these datasets.

The instructions (see item 3) include details on how to find and deploy a few sample Power BI Information Oversharing templates.

5. Conclusion

I hope this provides a good overview of the Information Oversharing template for Microsoft Fabric. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.

Error: Consent not found

Jose_Barreto — Thu, 13 Feb 2025 01:16:39 GMT

If you have been using Microsoft Graph Data Connect for a while, you might have configured a recurring pipeline that has been running for several months. In that case, you might run into this error:

Operation on target <name> failed: ErrorCode=UserErrorOffice365DataLoaderError, 'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=Office365 data loading failed to execute. office365LoadErrorType: PermanentError. Consent not found. The consent for application [xx] was either never granted or expired. Learn more at https://aka.ms/mgdcdocs/appconsent., Source=Microsoft.DataTransfer.ClientLibrary,'

The main error is really " The consent for application [x] was either never granted or expired.” If this is the first time you are running this pipeline, you could be missing the consent in Azure (it was not yet approved by your Microsoft 365 Administrator). If this was working before, it has likely been more than 6 months since it was approved, and the consent has expired.

The solution is simple, done in just 2 steps:

In the Azure Portal, go to the Microsoft Graph Data Connect application and update it.
In the Microsoft 365 Admin Center, approve the application update.

To update the application, search for the “Microsoft Graph Data Connect” service in the Azure Portal at https://portal.azure.com or navigate directly to https://aka.ms/MGDCinAzure to get started. Once there, click on the MGDC Application your created previously (that is the request for consent), click on Properties on the left navigation bar and then click on the “Update Properties” button. You do not need to change any of the properties.

After you update the consent request, you need to approve the update in the Microsoft 365 Admin Center. This process is described in item “1h. Approve the Microsoft Graph Data Connect Application” in the step-by-step guide.

Before you update the application properties in the Azure Portal, it will show in the list in the Microsoft 365 Admin Center as “Expired”. After you update the properties, it will show in the list as “Update Available”. Once you approve and everything is ready, it will show as “Approved.”

Just click on the app and follow the steps to approve it.

That is it! Now you can just run the pipeline again and everything will work as before. Until it is time to update the consent again in 6 months…

You can avoid this issue completely by keeping track of the MGDC Application consent date and updating it ahead of time. You can update the application and continue to use the earlier consent until the Microsoft 365 Administrator approves the new one, allowing for this consent rotation to be done proactively.

For more about MGDC for SharePoint, visit the main article at https://aka.ms/SharePointData.

MGDC for SharePoint and SharePoint Embedded

Jose_Barreto — Wed, 15 Jan 2025 00:45:54 GMT

1. MGDC for SharePoint and SharePoint Embedded

In this post, we’ll discuss how to use data from Microsoft Graph Data Connect (MGDC) for SharePoint to understand how SharePoint Embedded is being configured and used in your tenant. If you’re new to MGDC, get an overview at https://aka.ms/SharePointData.

If you’re not familiar with SharePoint Embedded, check this Overview of SharePoint Embedded.

2. Introduction

SharePoint Embedded is an API-only solution to use the power of the Microsoft 365 file and document storage platform. You can use it to build applications that expose Microsoft 365 features including those from Office, Purview and Copilot. The SharePoint Embedded files are stored in “File Storage Containers”, created by your enterprise applications or a third-party application.

SharePoint Embedded Diagram

3. MGDC for SharePoint and SharePoint Embedded

One important detail is that MGDC for SharePoint sees these containers and files used by SharePoint Embedded. You can see the Containers (under Sites), the permissions granted and even each individual file inside the Containers.

4. Web Template Id

To filter sites for SharePoint Embedded containers, just look for SharePoint Sites in MGDC that have the RootWeb.WebTemplateId of 70. This is used to differentiate this scenario, in the same way that OneDrives are sites with the RootWeb.WebTemplateId of 21. You can also check for the RootWeb.Template, which shows as “CSPCONTAINER” for SharePoint Embedded.

A typical container for SharePoint Embedded will show in the SharePoint Sites datasets like this:

{ "ptenant": "00000001-0000-0000-0000-000000000000", "Id": "00000002-0000-0000-0000-000000000000", "Url": "https://tenantname.sharepoint.com/contentstorage/CSP_00000009-0000-0000-0000-000000000000", "ArchiveState": "None", "RootWeb": { "Configuration": 0, "Id": "00000003-0000-0000-0000-000000000000", "Title": "Ideas", "WebTemplate": "CSPCONTAINER", "WebTemplateId": 70, "LastItemModifiedDate": "2025-01-10T22:55:18.000Z" }, "WebCount": 1, "StorageQuota": 27487790694400, "StorageUsed": 149186, "StorageMetrics": { "MetadataSize": 8506, "TotalFileCount": 2, "TotalFileStreamSize": 41924, "TotalSize": 149186 }, "GroupId": "00000000-0000-0000-0000-000000000000", "GeoLocation": "CAN", "IsInRecycleBin": false, "RecycleBinItemCount": 0, "RecycleBinItemSize": 0, "SecondStageRecycleBinStorageUsage": 0, "IsTeamsConnectedSite": false, "IsTeamsChannelSite": false, "TeamsChannelType": "None", "IsCommunicationSite": false, "IsHubSite": false, "HubSiteId": "00000000-0000-0000-0000-000000000000", "IsOneDrive": false, "BlockAccessFromUnmanagedDevices": false, "BlockDownloadOfAllFilesOnUnmanagedDevices": false, "BlockDownloadOfViewableFilesOnUnmanagedDevices": false, "IsExternalSharingEnabled": true, "ShareByEmailEnabled": true, "ShareByLinkEnabled": true, "SensitivityLabelInfo": {}, "IBMode": "Open", "RelatedGroupId": "00000000-0000-0000-0000-000000000000", "Owner": { "Name": "System Account" }, "SecondaryContact": {}, "ReadLocked": false, "ReadOnly": false, "CreatedTime": "2025-01-09T22:54:50Z", "LastSecurityModifiedDate": "2025-01-09T22:57:11Z", "Operation": "Full", "LastUserAccessDate": "2025-01-09T22:54:51Z", "SnapshotDate": "2025-01-11T00:00:00Z" }

Note: In all the sample JSON objects shown in this post, ids were redacted with zeros.

You can learn more about how SharePoint uses WebTemplateId to identify the type of site at MGDC for SharePoint FAQ: Is OneDrive included?

5. Permissions

The SharePoint Permissions dataset includes all the permissions granted for all the several types of SharePoint content, including OneDrive, Teams-connected sites, and SharePoint Embedded containers. Granular permissions by folder and individual files are also included.

Here’s a typical object from the SharePoint Permissions dataset:

{ "ptenant": "00000001-0000-0000-0000-000000000000", "SiteId": "00000002-0000-0000-0000-000000000000", "WebId": "00000003-0000-0000-0000-000000000000", "ListId": "00000004-0000-0000-0000-000000000000", "ItemType": "File", "ItemURL": "contentstorage/CSP_00000009-0000-0000-0000-000000000000/Document Library/LoopAppData/ProjectBlue.loop", "FileExtension": "loop", "RoleDefinition": "Contribute", "LinkId": "00000005-0000-0000-0000-000000000000", "ScopeId": "00000006-0000-0000-0000-000000000000", "LinkScope": "Organization", "SharedWithCount": [], "SharedWith": [], "Operation": "Full", "SnapshotDate": "2025-01-11T00:00:00Z", "ShareCreatedBy": { "Type": "User", "Name": "Test User", "Email": "test@tenantname.onmicrosoft.com", "UPN": "test@tenantname.onmicrosoft.com" }, "ShareCreatedTime": "2025-01-09T22:55:51Z", "ShareLastModifiedBy": { "Type": "User", "Name": "Test User", "Email": "test@tenantname.onmicrosoft.com", "UPN": "test@tenantname.onmicrosoft.com" }, "ShareLastModifiedTime": "2025-01-09T22:55:51Z", "ListItemId": 4, "UniqueId": "00000007-0000-0000-0000-000000000000" }

Note: The example above shows a single Contribute permission to a Loop file using a sharing link. Loop uses SharePoint Embedded to store files and it’s something you can easily configure and test in your tenant.

MGDC for SharePoint will show all permissions granted in SharePoint Embedded containers, included permissions granted at various levels of the SharePoint hierarchy (site, web, library, folder, or file) and diverse types of security principals (users, SharePoint groups and Entra ID groups, internal or external).

You can learn more about it at MGDC for SharePoint FAQ: What is in the Permissions dataset?

6. Files

Finally, you can get a complete list of every file inside the SharePoint Embedded containers, including details like file name, size, author, last modifier and more.

Here’s a typical object from the SharePoint Files dataset:

{ "ptenant": "00000001-0000-0000-0000-000000000000", "SiteId": "00000002-0000-0000-0000-000000000000", "Author": { "Name": "Test User", "Email": "test@tenantname.onmicrosoft.com" }, "DirName": "Document Library/LoopAppData", "Extension": "loop", "FileName": "ProjectBlue.loop", "IsLabelEncrypted": false, "ItemId": "00000008-0000-0000-0000-000000000000", "ListId": "00000004-0000-0000-0000-000000000000", "ListServerTemplate": "DocumentLibrary", "MajorVersion": 4, "MinorVersion": 0, "ModifiedBy": { "Name": "Test User", "Email": "test@tenantname.onmicrosoft.com" }, "Operation": "Full", "ScopeId": "00000006-0000-0000-0000-000000000000", "SensitivityLabelInfo": {}, "SiteUrl": "https://tenantname.sharepoint.com/contentstorage/CSP_00000009-0000-0000-0000-000000000000", "SizeInBytes": 24301, "SizeInBytesWithVersions": 57765, "TimeCreated": "2025-01-09T22:55:08Z", "TimeLastModified": "2025-01-09T22:56:13Z", "WebId": "00000003-0000-0000-0000-000000000000", "WebTemplateId": 70, "SnapshotDate": "2025-01-11T00:00:00Z" }

Note: SharePoint Embedded does not support minor versions, so the “MinorVersion” column in the Files dataset will always show zero in this scenario.

For details, check the schema for the SharePoint Files dataset.

7. Filtering

You can easily download entire datasets and use Azure Data Tools to filter for rows where WebTemplateId is 70. For Sites and Files, there is also a mechanism to filter at the source, so you pull from Microsoft Graph Data Connect only the data related to SharePoint Embedded. The same filtering mechanism can be used to pull data just for a specific site, filtered by Id.

For details, check this article about Filtering SharePoint dataset at the source.

8. Conclusion

I hope this post helped you understand how to use MGDC for SharePoint to gather information about SharePoint Embedded containers and files. For more information, visit https://aka.ms/SharePointData.

MGDC for SharePoint FAQ: How to use Excel for analytics

Jose_Barreto — Wed, 22 Jan 2025 17:52:34 GMT

MGDC for SharePoint – Dataset types and features

In this post, we’ll discuss how to use Microsoft Excel to consume the data from Microsoft Graph Data Connect (MGDC) for SharePoint. If you’re new to MGDC, get an overview at https://aka.ms/SharePointData.

1. Introduction

In most examples on how to analyze the data from MGDC for SharePoint, we use Azure Synapse in combination with Power BI Desktop, which is a very capable tool that can turn your data into coherent, visually immersive, and interactive insights.

But keep in mind that, when pulling datasets from MGDC with Azure Synapse, the data becomes available in an Azure storage account as a set of JSON files. From there, you can use many different tools beyond Power BI. In this article, we will focus on how to read and analyze that data using Microsoft Excel.

2. Get the data into Azure storage

The first step in the process is to get your data from MGDC. I have a blog post that describes that process at Step-by-step: (Synapse) Gather a detailed dataset on SharePoint Sites using MGDC. Section 1 if that blog shows how to setup the Microsoft Graph Data Connect and section 2 describes how to run an Azure Synapse pipeline to pull the data. If you complete these first two sections, you will end up with your SharePoint Sites data in Azure.

You can skip section 3 of that blog, that describes how to get the data into a Power BI dashboard. You will replace that section with the information in this article, which takes the data to Microsoft Excel instead.

3. Bring the data from Azure storage to Excel

Next, you need to bring the data in the JSON files from the Azure storage account to your Excel spreadsheet. Fortunately, Excel has the perfect set of tools to bring in that data.

You should start with a blank spreadsheet. Select the “Data” tab in the ribbon and use the option to “Get Data”. Then select the option “from Azure” and choose “From Azure Data Lake Storage Gen2”. This will make sure that you’re getting the data in your ADLS Gen2 storage account.

Data --> Get Data --> From Azure --> From Azure Data Lake Storage Gen2

Next, you will need to provide the URL to the ADLS Gen2 storage. Provide the full path, with the container and folder in the format: https://account.dfs.core.windows.net/container/folder. The account name is in step 1d of the at step-by-step guide mentioned before. The container and folder names are in step 1f of the guide. You can also find the URL in the Azure Portal, under the Settings/Endpoints section of the storage account.

URL to Azure Data Lake Storage Gen2

Once you provide the URL and click OK, you will need to authenticate. Select the option to use the storage account key and type in the key you saved in step 1d of the at step-by-step guide. You can find the key in the Azure Portal, under the “Security + Networking” / “Access Keys” section of the storage account.

Account key

Once you have typed the account key, click “Connect”.

File List

At this point, you want to click on the “Transform Data” button to filter for the specific files you want. This will open the “Power Query Editor” window, where you can apply specific steps to filter the data coming in.

Power Query Editor

The easiest way to filter out the metadata files is to scroll right until you find the “Folder Path” column and filter that column for paths that do not contain the word “metadata”.

Folder Path --> down arrow --> Text filters --> Does not contain...Filter rows, does not contain "metadata"

After that, you can scroll left to the “Content” column and click on the button to “Combine files”. After that, the query will pull in the content of the files and expand all the columns from the JSON.

Filtered list --> Content --> Combine files

Finally, you can click on the “Close & Load” button to bring the data into a sheet.

Close & Load

With that, you can play with the data in Excel.

Data returned to Excel

4. Create PivotTables

Once the data is imported into your Excel spreadsheet, you will be able to create data visualizations from it. The most interesting way to do this is to use Excel PivotTables.

For instance, here’s a pivot table with storage (Storage Used) by site type (RootWeb.WebTemplate) on the different dates you pulled data (SnapShotDate). See the sample PivotTable below after some adjustments and formatting.

PivotTable in Excel

5. Tips

Here are a few additional comments.

Distinct types of authentication: As you might have noticed, there are multiple ways to authenticate to get access to the Azure storage account. In addition to using a key to the entire storage account as shown here, you could use Azure to grant specific permissions to a user and log in as that user. This will help restrict access to specific folders and grant limited permissions (like read-only).
Create custom columns: To help with your data analysis, you might want to add a few custom or calculated columns. You can in many ways, including transformations in Power Query, formulas in the Excel sheet or in the pivot table itself. You can read more about it at How to create custom columns in Power BI (this says Power BI but applies to Excel as well).
More than one million rows: If you’re using this to visualize data from the SharePoint Sites dataset, you are likely to have fewer than one million rows, which is the Microsoft Excel limit. If you’re one of the few tenants that will have more than one million sites or if you’re analyzing a larger dataset like Permissions or Files, you can only pull the first one million rows into a sheet. To overcome this, you can load the data into a model and use that to drive your pivot tables. You can read more about it at Loading CSV/text files with more than a million rows into Excel (this says text files but applies to JSON as well).

6. Conclusion

I hope this post helped you understand how to use Microsoft Excel to pull JSON data from an Azure storage account used by MGDC for SharePoint. For more information, visit https://aka.ms/SharePointData.

Error: The provided client secret keys are expired

Jose_Barreto — Wed, 27 Nov 2024 20:38:10 GMT

If you have been using Microsoft Graph Data Connect for a while, you might have configured a recurring pipeline that has been running for several months. In that case, you might run into this error:

Operation on target task_name failed:
ErrorCode=UserErrorOffice365SinkValidationError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading job failed to validate the sink data store.
Possible causes are the name of the data store or its folder path is invalid,
or the provided credentials do not have enough permission to access the folder.
Detailed error: ADLS Gen2 operation failed for: Failed to get access token by using service principal.
Error: invalid_client, Error Message: A configuration issue is preventing authentication –
check the error message from the server for details.
You can modify the configuration in the application registration portal.
See https://aka.ms/msal-net-invalid-client for details. 
Original exception: AADSTS7000222:
The provided client secret keys for app 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' are expired.
Visit the Azure portal to create new keys for your app: https://aka.ms/NewClientSecret,
or consider using certificate credentials for added security: https://aka.ms/certCreds.
Trace ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Correlation ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Timestamp: 2024-11-26 18:35:23Z.
Account: 'account_name'.
FileSystem: 'container_name'.
Path: 'folder_name'.,
Source=Microsoft.DataTransfer.ClientLibrary,'

The main error is really "The provided client secret keys for app 'x' are expired."
The solution is simple. Go back to Azure Entra ID to configure a new secret for the App Registration. You can follow the instructions at step 1b in our Synapse step-by-step instructions to add a new secret.

Finally, be sure to add the new secret to the link services of the pipeline (you need to update both the source (Source/MGDC) and the destination (Sink/Storage).

You avoid this issue completely by keeping track of the secret expiration date and creating a new one (and updating the pipeline) ahead of time. You can have multiple secrets in place, allowing for this secret rotation to be done proactively.

Error: Value cannot be null - Empty Columns

Jose_Barreto — Wed, 27 Nov 2024 17:19:05 GMT

A few customers are seeing an issue with Synapse pipelines for Microsoft Graph Data Connect for SharePoint.

The symptom is straightforward: you get an error message saying that the "Columns" parameter in the request is null. It shows something like this:

Operation on target TTT failed:

Activity failed because an inner activity failed; 
Inner activity name: XXX, 
Error:  
    ErrorCode=UserErrorOffice365DataLoaderError,
    'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
    Message=Office365 data loading failed to execute. 
    office365LoadErrorType: PermanentError.
    Value cannot be null.
    Parameter name: Columns,
    Source=Microsoft.DataTransfer.ClientLibrary,'

This error is basically telling you that the Columns parameter is required and cannot be null.

The columns, defined in the source definition of the copy data task, should have been automatically populated for you, so this is not expected.

To fix it, you can try these steps:

Opening the pipeline that failed
Select the copy data task in the pipeline
Select the source tab
Under "output columns" click on "Clear"
Under "output columns" click on "Import Schema"
Publish the pipeline changes
Trigger the pipeline again

The picture below shows the two options you need to click on

Click on "Clear", then click on "Import Schema"

I hope this article can help you fix your pipeline. For more articles on Microsoft Graph Data Connect for SharePoint, see https://aka.ms/SharePointData

MGDC for SharePoint FAQ: Why does the file count not match?

Jose_Barreto — Wed, 27 Nov 2024 17:21:36 GMT

I am frequently asked why the number of files in the Sites dataset does not match the count of the Files dataset. It is true that they sometimes don't. Here are the counts we are talking about:

Sites:

SELECT Id as SiteId, StorageMetrics.TotalFileCount AS FilesInSite FROM Sites

Files:

SELECT SiteId, COUNT(*) AS FilesInSite FROM Files GROUP BY SiteId

The main reasons for the discrepancy are:

The Files dataset is collected weekly, while the Sites dataset is collected daily, making it difficult to capture the exact same state.
The Files dataset includes only items inside Document Libraries, whereas the Sites dataset counts all files, including those in other list types.
The Files dataset does not include pages (files ending in .ASPX), while the Sites dataset counts all files.
The Files dataset does not include items in the primary and secondary recycle bins, whereas the Sites dataset counts all files.

Here are a few examples:

A new site was created, and a few files were uploaded to a document library in this new site. Two days later, you get the SharePoint Sites dataset and find the new site with the right count of files. However, you cannot find these files in the SharePoint Files dataset. This is because the Files dataset may take one week to refresh. Wait a week and try pulling the Files dataset again.
A team site was created with a few lists, where some of the items have file attachments. The SharePoint Files dataset does not show these file attachments. This is because the SharePoint Files dataset will only show files in document libraries.
You deleted files from a document library in a SharePoint site a few weeks ago. Now the count of files in the SharePoint Sites is smaller than the count of files in the SharePoint Files dataset. That is because files in the recycle bins are excluded from the Files dataset. Once these deleted files go through the first and second level recycle bins, the count in the Sites dataset will also reflect that reduction.

Archimedes comparing file counts...

MGDC for SharePoint FAQ: How to flatten datasets for SQL or Fabric

Jose_Barreto — Wed, 27 Nov 2024 17:25:50 GMT

When you get your data from Microsoft Graph Data Connect (MGDC), you will typically get that data as a collection of JSON objects in an Azure Data Lake Storage (ADLS) Gen2 storage account. For those handling large datasets, it might be useful to move the data to a SQL Server or to OneLake (lakehouse). In those cases, you might need to flatten the datasets. This post describes how to do that.

If you’re not familiar with MGDC for SharePoint, start with https://aka.ms/SharePointData.

1. Flattening

Most of the MGDC for SharePoint datasets come with nested objects. That means that a certain object has other objects inside it. For instance, if you have a SharePoint Groups object, it might have multiple Group Members inside. If you have a SharePoint Permissions object, you could have many Permissions Recipients (also known as Sharees). For each SharePoint File object, you will have a single Author object inside.

When you convert the datasets from JSON to other formats, it is possible that these other formats require (or perform better) if you don’t have any objects inside objects. To overcome that, you can turn those child objects into properties of the parent object. For instance, instead of having the File object with an Author object inside, you can have multiple author-related columns. For instance, you could have Author.Name and Author.Email as properties of the flattened File object.

2. Nested Objects

You can get the full list of SharePoint datasets in MGDC at https://aka.ms/SharePointDatasets.

Here is a table with a list of objects and their nested objects:

Object	How many?	Primary Key	Nested Object	How many?	Add to Primary Key
Sites	1 per Site	Id	RootWeb	1 per Site
Sites	1 per Site	Id	StorageMetrics	1 per Site
Sites	1 per Site	Id	SensitivityLabelInfo	1 per Site
Sites	1 per Site	Id	Owner	1 per Site
Sites	1 per Site	Id	SecondaryContact	1 per Site
Groups	1 per Group	SiteId + GroupId	Owner	1 per Group
Groups	1 per Group	SiteId + GroupId	Members	1 per Member	COALESCE(AADObjectId, Email, Name)
Permissions	1 per Permission	SiteId + ScopeId + RoleDefintion + LinkId	SharedWithCount	1 per Recipient Type	Type
Permissions	1 per Permission	SiteId + ScopeId + RoleDefintion + LinkId	SharedWith	1 per Recipient or Sharee	COALESCE(AADObjectId, Email, Name)
Files	1 per File	SiteId + WebId + ListId + ItemId	Author	1 per File
Files	1 per File	SiteId + WebId + ListId + ItemId	ModifiedBy	1 per File

When you flatten a dataset and there is an object with multiple objects inside (like Group Members or Permission Recipients), the number of rows will increase. You also need to add to primary key to keep it unique.

Also note that the File Actions, Sync Health and Sync Errors datasets do not have any nested objects.

3. One Object per Parent

When the nested object has only one instance, things are simple. As we described for the Author nested object inside the File object, you promote the properties of the nested object to be properties of the parent object. This is because the Author is defined as the user that initially created the file. There is always one and only one Author.

This can happen even happen multiple times for the same object. The File also has a ModifiedBy property. That is the single user that last changed the file. In that case, there is also only one ModifiedBy per File.

The Site object also includes several properties in this style, like RootWeb, StorageMetrics, SensitivityLabelInfo, Owner and SecondaryContact. Note that, in the context of the Site object, there is only one owner. Actually two, but that second one is tracked in a separate object called SecondaryContact which is effectively the secondary owner.

4. Multiple Objects per Parent

The SharePoint Permissions dataset has a special condition that might create trouble for flattening. There are two sets of nested objects with multiple objects each: SharedWith and SharedWithCount. SharedWith has the list of Recipients and SharedWithCount has a list of Recipient Types. If you just let the tools flatten it, you will end up a cross join of the two. As an example, if you have 4 recipients in an object and 2 types of recipients (internal users and external users, for instance) you will end up with 20 objects in the flattened dataset instead of the expected 10 objects (one per recipient).

To avoid this, in this specific condition, I would recommend just excluding the SharedWithCount column from the object before flattening.

5. Conclusion

I hope this clarifies how you can flatten the MGDC for SharePoint datasets, particularly SharePoint Permissions dataset.

For further details about the MGDC for SharePoint, https://aka.ms/SharePointData.

MGDC for SharePoint FAQ: How can I estimate my Azure bill? Updated!

Jose_Barreto — Fri, 01 Nov 2024 22:03:16 GMT

Introduction

When gathering SharePoint data through Microsoft Graph Data Connect, you are billed through Azure. As I write this blog, the price to pull 1,000 objects from Microsoft Graph Data Connect in the US is $0.75, plus the cost for Azure infrastructure like Azure Storage and Azure Synapse.

That is true for all datasets except the SharePoint Files dataset, which has a different billing rate. Because of its typical high volume, the SharePoint Files dataset is billed at $0.75 per 50,000 objects.

I wrote a blog about what counts as an object, but I frequently get questions about how to estimate the overall Azure bill for the Microsoft Graph Data Connect for SharePoint for a specific project. Let me try to clarify things…

Before we start, here are a few notes and disclaimers:

These are estimates and your specific Azure bill will vary.
Check the official Azure links provided. Rates may vary by country and over time.
These are Azure pay-as-you-go list prices in the US as of October 2024.
You may benefit from Azure discounts, like savings using a pre-paid plan.

How many objects?

To estimate the number of objects, you start by finding out the number of sites in the tenant. This should include all sites (not just active sites) in your tenant. You can find this number easily in the SharePoint Admin Center. That will be the number of objects in your SharePoint Sites dataset.

Finding the number of SharePoint Groups and SharePoint Permissions will require some estimation. I recently collected some telemetry and saw that the average number of SharePoint Groups per Site for a sample of large tenants was around 31. The average SharePoint permissions per site was around 61. The average number of files per site was 2,874.

Delta pulls (gathering just what changed) will be smaller, but that also varies depending on how much collaboration happens in your tenant (in the Delta numbers below, I am estimating a 5% change for an average collaboration level).

Here’s a table to help you estimate your Microsoft Graph Data Connect for SharePoint costs:

Notes for the table above:

* Higher collaboration level assumes twice the average in terms of groups, permissions and files.
** Security scenario includes Sites, Groups and Permissions. Capacity Scenario includes Sites and Files.
*** Delta assumes 5% change for average collaboration and 10% change for high collaboration. These are on the high side for one week's worth of changes. Your numbers will likely be smaller.

As you can see, smaller tenants with an average collaboration will see costs below $10 for the smaller Sites dataset and below $1,000 for larger datasets like Permissions or Files.

If you want to estimate the number of SharePoint objects more precisely, there is an option to sample the dataset and get a total object count without pulling the entire dataset. For more information, see MGDC for SharePoint FAQ: How can I sample or estimate the number objects in a dataset?

The official information about Microsoft Graph Data Connect pricing is at https://azure.microsoft.com/en-us/pricing/details/graph-data-connect/

How much storage?

The SharePoint information you get from Microsoft Graph Data Connect will be stored in an Azure Storage account. That also incurs some cost, but it’s usually small when compared to the Microsoft Graph Data Connect costs for data pulls. The storage will be proportional to the number of objects and to the size of these objects.

Again, this will vary depending on the amount of collaboration in the tenant. More sharing means more members in groups and more people in the permissions, which will result in more objects and also larger objects.

I also did some estimating of object size and arrived at around 2KB per SharePoint Site object, 20KB per SharePoint Group object, 3KB per Permission object and 1KB per file object. There are several Azure storage options including Standard vs. Premium, LRS vs. GRS, v1 vs. v2 and Hot vs. Cool. For Microsoft Graph Data Connect, you can go with a Standard + LRS + V2 + Cool blob storage account, which costs $0.01 per GB per month.

Here’s a table to help you estimate your Azure Storage costs:

The same notes from the previous table apply here.

As you can see, smaller tenants with average collaboration will see storage costs below $1000/month, most of it going to storing the larger Files dataset. The cost for delta dataset storage is also fairly small, even for the largest of tenants. There are additional costs per storage operation like read and write but those are negligible at this scale (for instance, $0.065 per 10,000 writes and $0.005 per 10,000 reads).

The official information about Azure Storage pricing is at https://azure.microsoft.com/en-us/pricing/details/storage/blobs/

What about Synapse?

You will also typically use Azure Synapse to move the SharePoint data from Microsoft 365 to your Azure account. You could run a pipeline daily to get the information and do some basic processing, like computing deltas or creating aggregations.

Here are a few of the items that are billed for Azure Synapse when running Microsoft Graph Data Connect pipelines:

Azure Hosted – Integration Runtime – Data Movement – $0.25/DIU-hour
Azure Hosted – Integration Runtime – Pipeline Activity (Azure Hosted) – $0.005/hour
Azure Hosted – Integration Runtime – Orchestration Activity Run – $1 per 1,000 runs
vCore – $0.15 per vCore-hour

As with Azure Storage, the costs here are small. You will likely need one pipeline run per day and it will typically run in less than one hour for a small tenant. Large tenants might need a few hours per run to gather all their SharePoint datasets. You should expect less than $10/month for smaller tenants and less than $100/month for larger and/or more collaborative tenants.

The official information about Azure Synapse pricing is at https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/

Closing notes

These are the main meters in Azure to get you started with costs related to Microsoft Graph Data Connect for SharePoint. I suggest experimenting with a small test/dev tenant to get familiar with Azure billing.

For more information about Microsoft Graph Data Connect for SharePoint, see the links at https://aka.ms/SharePointData.

Information Oversharing Template - MGDC for SharePoint

Jose_Barreto — Wed, 06 Nov 2024 19:47:31 GMT

1. Security Scenario

One of the most common scenarios for Microsoft Graph Data Connect for SharePoint is Information Oversharing. This security scenario is about knowing which items are being shared widely in the tenant and understanding how permissions are applied at each level.

The main MGDC datasets for this scenario are SharePoint Sites and SharePoint Permissions, plus several group-related datasets (SharePoint Groups, AAD Group Details, AAD Group Owners and AAD Group Members). If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets.

2. Information Oversharing Template

To help you use these datasets, the team has worked on a Information Oversharing Template. This includes the following:

A set of instructions
An ARM template
An Azure Synapse pipeline template
A Power BI template
A sample Power BI dashboard

Let’s investigate each one.

3. Instructions

The template comes with a set of instructions with details about registering an application, creating a secret, deploying an ARM template, importing a pipeline template, running the pipeline and creating a Power BI dashboard.

You can find the instructions for each of the items above at https://go.microsoft.com/fwlink/?linkid=2207816.

4. ARM Template

The Information Oversharing template includes an Azure Resource Manager template that helps you set up the resource group, Synapse workspace, Spark pool, storage account, and storage permissions. You deploy the ARM template to automatically provision those resources.

You can find details on how to find and deploy the ARM template in the instructions (see item 3).

5. Pipeline Template

This Security scenario includes a pipeline template that you can use to simplify your deployment. In Azure Synapse, when you need to create a pipeline, you can use the Synapse Gallery to find the pipeline template.

You can find details on how to create and run the pipeline in the instructions (see item 3).

6. Power BI Template

The last step is to create a set of reports and dashboards using Power BI. There is a Power BI template for that, which comes in the form of a .PBIT file. This helps you quickly turn your data into ready-to-use preconfigured dashboards and reports, which you can then customize.

You can find details on how to find and use the Power BI template in the instructions (see item 3). We also provide a Power BI file with some sample data which you can get from
https://go.microsoft.com/fwlink/?linkid=2211101 (click on the download button to get the PBIX file).

7. Sample Dashboards and Reports

Here are samples of the reports and dashboards included in the Power BI Template.

Security Main Report

Sharing by Item Type

Sharing by Role Definition

Sharing by File Extension

Sharing by Sensitivity Label

External Sharing

External Sharing by Domain

External Sharing by Sensitivity and Domain

Note: These dashboards are provided as examples or starting points, which you can modify as needed. There are many more ways to visualize the data in these datasets.

8. Conclusion

I hope this provided a good overview of the Information Oversharing template. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.