mgdc
38 TopicsMGDC for SharePoint FAQ: How to Run a PoC without Pulling Your Entire Tenant
Overview When getting started with SharePoint data in Microsoft Graph Data Connect (MGDC) for SharePoint, many teams want to validate scenarios - such as reporting or analytics before committing to a full production deployment. A common first instinct is to pull a complete dataset from a production tenant. While this delivers the most comprehensive view of SharePoint usage, it also: Requires broad administrative authorization Consumes the most Azure compute and storage resources Increases MGDC extraction and processing costs Adds complexity to early experimentation Fortunately, MGDC for SharePoint provides multiple ways to run low‑cost experiments or proof‑of‑concept (POC) deployments using partial or scoped datasets. This guide presents these options using a uniform comparison model, helping you choose the right approach based on: Cost Representativeness of production behavior Implementation effort Dataset completeness Supported datasets Option 1: Use a Dev or Test Tenant Description Use an existing development or test tenant (or create a new trial tenant) to enable MGDC and run initial experiments. Pros Smaller datasets reduce MGDC and Azure costs Easier to obtain administrative permissions Lower operational impact Cons May not reflect production‑scale usage patterns Some SharePoint features or integrations may be missing Requires simulated user activity to generate meaningful data Trial tenants are time‑limited Learn More Microsoft 365 Trial Options Azure Trial Options Option 2: Start with the SharePoint Sites Dataset Description The Sites dataset is typically the smallest MGDC dataset for SharePoint and provides tenant‑wide metadata for all site collections. Pros Lower cost compared to Files or Permissions datasets Provides organization‑wide coverage Minimal MGDC configuration beyond standard onboarding Small dataset can be handled directly by a variety of analysis tools Cons Does not include permission or file details Limited insight compared to full datasets Learn More How can I estimate my Azure bill? Updated! Gather a detailed dataset on SharePoint Sites Option 3: Sample a Limited Number of Rows Description Some MGDC SharePoint datasets support returning only a subset of rows in query results. This is supported across the top 5 SharePoint datasets in MGDC (Sites, Permissions, Groups, Files and File Actions). Pros Minimal and predictable extraction cost Enables rapid schema inspection Provides total dataset row count in request metadata Cons Rows are not returned in a predictable order Sample is not randomized. It is not reproducible and could be biased Results should not be used to draw tenant‑level conclusions Learn More How can I sample or estimate the number of objects in a dataset? Option 4: Filter by SiteId Description Because SharePoint data is partitioned by site collection, MGDC filtering allows you to extract data from a single site or a small group of representative sites. This supports Sites, Permissions, Groups, Files and File Actions datasets. Pros Enables realistic workload simulation Reduces total extraction volume Simplifies downstream reporting Cons May introduce sampling bias Not suitable for tenant‑wide reporting Learn More How can I filter rows on a dataset? Option 5: Filter by TemplateId Description Instead of selecting individual sites, filter by site template to isolate specific workloads. For example, you could filter for OneDrives or SharePoint Embedded. Pros Consistent dataset scope Useful for workload‑specific analysis Cons Limited dataset support (supported only for Sites, Files and File Actions) May not reflect cross‑workload usage patterns Learn More How can I filter rows on a dataset? Option 6: Use Delta State Datasets Description Delta datasets allow you to retrieve only changes since your last data transfer for supported SharePoint State datasets. Pros Enables recurring analytics with lower extraction costs Supports daily or weekly trend analysis Reduces data movement after initial ingestion Cons Requires an initial full dataset pull Adds complexity to downstream merge processing Learn More How can I use Delta State Datasets? How do I process Deltas? Summary MGDC for SharePoint provides multiple approaches to extract targeted subsets of tenant data, allowing teams to: Run proof‑of‑concept deployments Validate analytics pipelines Test governance or migration scenarios Estimate ongoing MGDC and Azure costs By selecting the right combination of dataset scope, filtering strategy, sampling method or delta tracking, you can balance cost, representativeness, and implementation effort before scaling to a full production deployment. For additional guidance on MGDC for SharePoint, visit SharePoint Data in MGDC.Capacity Template v2 with Microsoft Fabric
1. Capacity Scenario One of the most common scenarios for Microsoft Graph Data Connect (MGDC) for SharePoint is Capacity. This scenario focuses on identifying which sites and files are using the most storage, along with understanding the distribution of these large sites and files by properties like type and age. The MGDC datasets for this scenario are SharePoint Sites and SharePoint Files. If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets. To assist you in using these datasets, the team has developed a Capacity Template. Initially published as a template for Azure Synapse, we now have a new Microsoft Fabric template that is simpler and offers more features. This SharePoint Capacity v2 Template, based on Microsoft Fabric, is now publicly available. 2. Instructions The template comes with a set of detailed instructions at https://aka.ms/fabriccapacitytemplatesteps. These instructions include: How to install the Microsoft Fabric and Microsoft Graph Data Connect prerequisites How to import the pipeline template from the Microsoft Fabric gallery and set it up How to import the Power BI template and configure the data source settings See below some additional details about the template. 3. Microsoft Fabric Pipeline After you import the pipeline template, it will look like this: 4. Pipeline in Microsoft Fabric The Capacity template for Microsoft Fabric includes a few key improvements: The new template uses delta datasets to update the SharePoint Sites and SharePoint Files datasets. It keeps track of the last time the datasets were pulled by this pipeline, requesting just what changed since then. The new template uses views to do calculations and create new properties like size bands or date bands. In our previous template, this was done in Power Query, when importing into Power BI. The new template also uses a view to aggregate file data, grouping the data by file extension. You can find details on how to find and deploy the Microsoft Fabric template in the instructions (see item 3). 5. Microsoft Fabric Report The typical result from this solution is a set of Power BI dashboards pulled from the Microsoft Fabric data source. Here are some examples: These dashboards serve as examples or starting points and can be modified as necessary for various visualizations of the data within these datasets. The instructions (see item 3) include details on how to find and deploy a few sample Power BI Capacity templates. 6. Conclusion I hope this provides a good overview of the Capacity template for Microsoft Fabric. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.Using Copilot (GPT-5) to turn schema into SQL and write queries
This article shows a few ways you can use Copilot to help with Microsoft Graph Data Connect for SharePoint. This includes turning a schema into a SQL CREATE TABLE statement and summarizing data from that. Make sure to use the Copilot option to use GPT-5 for best results. Scenario 1 This first scenario just builds a CREATE TABLE statement from the dataset schema definition in markdown notation. I provide a few instructions on how to handle the flattening of the objects. Me: Given the schema at this URL: https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md, give me a CREATE TABLE statement for a SQL Server. Just include the single SQL command in the output. Each column should have a comment with a short description. Don't include any information about keys or constraints. This dataset is flattened. One column in the table for every row, except those of type object. Column names to be flattened with a prefix. Use an underline to separate the prefix. Copilot: CREATE TABLE SharePointSites ( ptenant NVARCHAR(255) -- Id of the tenant ,Id NVARCHAR(255) -- GUID of the site ,Url NVARCHAR(2048) -- URL for the site ,ArchiveState NVARCHAR(50) -- Archive state of the site: None, Archiving, Archived, or Reactivating ,SiteArchivedBy NVARCHAR(255) -- Who archived the site, null if not archived ,SiteArchivedTime DATETIME -- When the site was archived, null if not archived ,RootWeb_Id NVARCHAR(255) -- Root web id ,RootWeb_Title NVARCHAR(512) -- Root web title ,RootWeb_WebTemplate NVARCHAR(255) -- Root web template name ,RootWeb_WebTemplateId INT -- Root web template id ,RootWeb_Configuration INT -- Root web template configuration id ,RootWeb_LastItemModifiedDate DATETIME -- Date when an item in the root web was last modified ,WebCount BIGINT -- Number of webs (subsites) in the site ,StorageQuota BIGINT -- Total storage in bytes allowed for this site ,StorageUsed BIGINT -- Total storage in bytes used by this site ,StorageMetrics_MetadataSize BIGINT -- Total metadata size for the site in bytes ,StorageMetrics_TotalFileCount BIGINT -- Total number of files for the site ,StorageMetrics_TotalFileStreamSize BIGINT -- Total size of latest version of files in bytes ,StorageMetrics_AdditionalFileStreamSize BIGINT -- Total size of additional file streams in bytes ,StorageMetrics_TotalSize BIGINT -- Total size of all files for the site in bytes ,GroupId NVARCHAR(255) -- Id of the group associated with this site ,GeoLocation NVARCHAR(50) -- Geographic region where the data is stored ,IsInRecycleBin BIT -- Indicates if the site is in the recycle bin ,RecycleBinItemCount BIGINT -- Number of items in the recycle bin ,RecycleBinItemSize BIGINT -- Size of items in the recycle bin ,SecondStageRecycleBinStorageUsage BIGINT -- Size of items in the second stage recycle bin ,IsTeamsConnectedSite BIT -- Indicates if the site is connected to Teams ,IsTeamsChannelSite BIT -- Indicates if the site is a channel site ,TeamsChannelType NVARCHAR(50) -- Type of channel if site is a Teams channel site ,IsHubSite BIT -- Indicates if the site is associated with a hub site ,HubSiteId NVARCHAR(255) -- Id of the hub site for this site ,IsCommunicationSite BIT -- Indicates if the site is a communication site ,IsOneDrive BIT -- Indicates if the site is a OneDrive ,BlockAccessFromUnmanagedDevices BIT -- Site blocks access from unmanaged devices ,BlockDownloadOfAllFilesOnUnmanagedDevices BIT -- Site blocks download of all files from unmanaged devices ,BlockDownloadOfViewableFilesOnUnmanagedDevices BIT -- Site blocks download of viewable files from unmanaged devices ,ShareByEmailEnabled BIT -- Site enables share by e-mail ,ShareByLinkEnabled BIT -- Site enables share by link ,IsExternalSharingEnabled BIT -- Indicates if external sharing is enabled ,SiteConnectedToPrivateGroup BIT -- Indicates if site is connected to a Private Group ,Privacy NVARCHAR(50) -- Privacy of the site: Private or Public ,SensitivityLabelInfo_Id NVARCHAR(255) -- Id of the Sensitivity Label for the site ,SensitivityLabelInfo_DisplayName NVARCHAR(255) -- Display name of the Sensitivity Label for the site ,Classification NVARCHAR(255) -- Classification of the site ,IBMode NVARCHAR(50) -- Information Barriers Mode ,IBSegments NVARCHAR(MAX) -- List of organization segments if IB mode is Explicit ,Owner_AadObjectId NVARCHAR(255) -- AAD Object Id of the owner ,Owner_Email NVARCHAR(320) -- Email of the owner ,Owner_UPN NVARCHAR(320) -- User Principal Name of the owner ,Owner_Name NVARCHAR(255) -- Name of the owner ,SecondaryContact_AadObjectId NVARCHAR(255) -- AAD Object Id of the secondary contact ,SecondaryContact_Email NVARCHAR(320) -- Email of the secondary contact ,SecondaryContact_UPN NVARCHAR(320) -- User Principal Name of the secondary contact ,SecondaryContact_Name NVARCHAR(255) -- Name of the secondary contact ,ReadLocked BIT -- Whether the site is locked for read access ,ReadOnly BIT -- Whether the site is in read-only mode ,CreatedTime DATETIME -- When the site was created (UTC) ,LastSecurityModifiedDate DATETIME -- When security on the site was last changed (UTC) ,LastUserAccessDate DATETIME -- Last access by a real user (UTC) ,LastContentChange DATETIME -- When the site contents were last changed (UTC) ,SnapshotDate DATETIME -- When this site information was captured (UTC) ,Operation NVARCHAR(50) -- Extraction mode of this row ); Scenario 2 Now I use the CREATE TABLE as the base to ask for a SELECT statement. Me: Give me a list of sites that have more than 5GB of files. Give me a single SQL statement using the tables described below. <create table statements included here> Copilot: SELECT Id AS SiteId, Url AS SiteUrl, RootWeb_Title AS SiteTitle, StorageMetrics_TotalSize AS TotalSizeInBytes, CAST(StorageMetrics_TotalSize / (1024.0 * 1024 * 1024) AS DECIMAL(10,2)) AS TotalSizeInGB FROM SharePointSites WHERE StorageMetrics_TotalSize > (5 * 1024 * 1024 * 1024) ORDER BY StorageMetrics_TotalSize DESC; Note: Interesting choice of columns and order, without any specific prompting to that effect. Scenario 3 Now for a more challenging scenario using two tables and a more complex requirement. Me: Given the tables below, write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. <create table statements for Sites and Files included here> Copilot: WITH ExtensionTotals AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalSize FROM SharePointFiles f WHERE f.Extension IS NOT NULL GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT et.SiteId, et.Extension, et.TotalSize, ROW_NUMBER() OVER (PARTITION BY et.SiteId ORDER BY et.TotalSize DESC) AS rn FROM ExtensionTotals et ), TopExtensions AS ( SELECT SiteId, STRING_AGG(Extension + ' (' + CAST(TotalSize AS VARCHAR(20)) + ')', ', ') WITHIN GROUP (ORDER BY TotalSize DESC) AS Top3Extensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageSize, te.Top3Extensions FROM SharePointSites s LEFT JOIN TopExtensions te ON s.Id = te.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; Notes: Scenario is nicely handled, using both the Files and Sites tables. Complex aggregation is coded correctly. Scenario 4 In this one, I skip the step of using a SQL CREATE TABLE statement and go directly from the markdown for the two schemas (Sites and Files) to a SQL SELECT statement. I also let GPT-5 add his comments and notes this time, to make sure it's thinking correctly... Me: Consider a SQL table for SharePoint Sites based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md - This dataset is flattened. There is one column in the SQL table for every row, except those of type object. Column names are flattened with a prefix. Column names use an underline to separate the prefix. Also consider a SQL table for SharePoint Files based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointfiles.md - Use the same assumptions as the previous one. Now write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. Copilot: Got it — you want to join the flattened SharePoint Sites table with the flattened SharePoint Files table, aggregate storage usage, and then find the top 3 file extensions by size for each site. Here’s a SQL example that works with the flattened column naming convention you described: WITH FileSizeByExtension AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalExtensionSize FROM SharePointFiles f GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT SiteId, Extension, TotalExtensionSize, ROW_NUMBER() OVER (PARTITION BY SiteId ORDER BY TotalExtensionSize DESC) AS rn FROM FileSizeByExtension ), Top3Extensions AS ( SELECT SiteId, STRING_AGG(Extension, ', ' ORDER BY TotalExtensionSize DESC) AS TopExtensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageBytes, t.TopExtensions FROM SharePointSites s LEFT JOIN Top3Extensions t ON s.Id = t.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; How it works FileSizeByExtension: Groups files by SiteId and Extension to sum their sizes. RankedExtensions: Uses ROW_NUMBER() to rank extensions by size within each site. Top3Extensions: Filters to the top 3 per site and concatenates them into a comma-separated list. Final SELECT: Joins back to the Sites table to get the site name and total storage size. Notes RootWeb_Title comes from flattening the RootWeb object in the Sites dataset. StorageMetrics_TotalSize is the flattened column for total site storage (excluding recycle bin). STRING_AGG works in SQL Server 2017+, Azure SQL, and some other RDBMS. If your SQL dialect doesn’t support it, you’d need a different aggregation method. This query orders sites by total storage size so the largest sites appear first. Note: Nicely done in one prompt with the proper explanations. Conclusion I hope these examples help you see the potential of getting Copilot's help with your queries. This is one more reason to load your MGDC for SharePoint tables directly into Microsoft Fabric, where you can run SQL queries against tables in the OneLake...