MGDC schema markdown to CREATE TABLE, plus queries you can use with Fabric
This article shows a few ways you can use Copilot to help with Microsoft Graph Data Connect for SharePoint. This includes turning a schema into a SQL CREATE TABLE statement and summarizing data from that. Make sure to use the Copilot option to use GPT-5 for best results.
Archimedes sculpting SQL in marbleScenario 1
This first scenario just builds a CREATE TABLE statement from the dataset schema definition in markdown notation. I provide a few instructions on how to handle the flattening of the objects.
Me:
Given the schema at this URL: https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md, give me a CREATE TABLE statement for a SQL Server. Just include the single SQL command in the output. Each column should have a comment with a short description. Don't include any information about keys or constraints. This dataset is flattened. One column in the table for every row, except those of type object. Column names to be flattened with a prefix. Use an underline to separate the prefix.
Copilot:
CREATE TABLE SharePointSites (
ptenant NVARCHAR(255) -- Id of the tenant
,Id NVARCHAR(255) -- GUID of the site
,Url NVARCHAR(2048) -- URL for the site
,ArchiveState NVARCHAR(50) -- Archive state of the site: None, Archiving, Archived, or Reactivating
,SiteArchivedBy NVARCHAR(255) -- Who archived the site, null if not archived
,SiteArchivedTime DATETIME -- When the site was archived, null if not archived
,RootWeb_Id NVARCHAR(255) -- Root web id
,RootWeb_Title NVARCHAR(512) -- Root web title
,RootWeb_WebTemplate NVARCHAR(255) -- Root web template name
,RootWeb_WebTemplateId INT -- Root web template id
,RootWeb_Configuration INT -- Root web template configuration id
,RootWeb_LastItemModifiedDate DATETIME -- Date when an item in the root web was last modified
,WebCount BIGINT -- Number of webs (subsites) in the site
,StorageQuota BIGINT -- Total storage in bytes allowed for this site
,StorageUsed BIGINT -- Total storage in bytes used by this site
,StorageMetrics_MetadataSize BIGINT -- Total metadata size for the site in bytes
,StorageMetrics_TotalFileCount BIGINT -- Total number of files for the site
,StorageMetrics_TotalFileStreamSize BIGINT -- Total size of latest version of files in bytes
,StorageMetrics_AdditionalFileStreamSize BIGINT -- Total size of additional file streams in bytes
,StorageMetrics_TotalSize BIGINT -- Total size of all files for the site in bytes
,GroupId NVARCHAR(255) -- Id of the group associated with this site
,GeoLocation NVARCHAR(50) -- Geographic region where the data is stored
,IsInRecycleBin BIT -- Indicates if the site is in the recycle bin
,RecycleBinItemCount BIGINT -- Number of items in the recycle bin
,RecycleBinItemSize BIGINT -- Size of items in the recycle bin
,SecondStageRecycleBinStorageUsage BIGINT -- Size of items in the second stage recycle bin
,IsTeamsConnectedSite BIT -- Indicates if the site is connected to Teams
,IsTeamsChannelSite BIT -- Indicates if the site is a channel site
,TeamsChannelType NVARCHAR(50) -- Type of channel if site is a Teams channel site
,IsHubSite BIT -- Indicates if the site is associated with a hub site
,HubSiteId NVARCHAR(255) -- Id of the hub site for this site
,IsCommunicationSite BIT -- Indicates if the site is a communication site
,IsOneDrive BIT -- Indicates if the site is a OneDrive
,BlockAccessFromUnmanagedDevices BIT -- Site blocks access from unmanaged devices
,BlockDownloadOfAllFilesOnUnmanagedDevices BIT -- Site blocks download of all files from unmanaged devices
,BlockDownloadOfViewableFilesOnUnmanagedDevices BIT -- Site blocks download of viewable files from unmanaged devices
,ShareByEmailEnabled BIT -- Site enables share by e-mail
,ShareByLinkEnabled BIT -- Site enables share by link
,IsExternalSharingEnabled BIT -- Indicates if external sharing is enabled
,SiteConnectedToPrivateGroup BIT -- Indicates if site is connected to a Private Group
,Privacy NVARCHAR(50) -- Privacy of the site: Private or Public
,SensitivityLabelInfo_Id NVARCHAR(255) -- Id of the Sensitivity Label for the site
,SensitivityLabelInfo_DisplayName NVARCHAR(255) -- Display name of the Sensitivity Label for the site
,Classification NVARCHAR(255) -- Classification of the site
,IBMode NVARCHAR(50) -- Information Barriers Mode
,IBSegments NVARCHAR(MAX) -- List of organization segments if IB mode is Explicit
,Owner_AadObjectId NVARCHAR(255) -- AAD Object Id of the owner
,Owner_Email NVARCHAR(320) -- Email of the owner
,Owner_UPN NVARCHAR(320) -- User Principal Name of the owner
,Owner_Name NVARCHAR(255) -- Name of the owner
,SecondaryContact_AadObjectId NVARCHAR(255) -- AAD Object Id of the secondary contact
,SecondaryContact_Email NVARCHAR(320) -- Email of the secondary contact
,SecondaryContact_UPN NVARCHAR(320) -- User Principal Name of the secondary contact
,SecondaryContact_Name NVARCHAR(255) -- Name of the secondary contact
,ReadLocked BIT -- Whether the site is locked for read access
,ReadOnly BIT -- Whether the site is in read-only mode
,CreatedTime DATETIME -- When the site was created (UTC)
,LastSecurityModifiedDate DATETIME -- When security on the site was last changed (UTC)
,LastUserAccessDate DATETIME -- Last access by a real user (UTC)
,LastContentChange DATETIME -- When the site contents were last changed (UTC)
,SnapshotDate DATETIME -- When this site information was captured (UTC)
,Operation NVARCHAR(50) -- Extraction mode of this row
);
Scenario 2
Now I use the CREATE TABLE as the base to ask for a SELECT statement.
Me:
Give me a list of sites that have more than 5GB of files. Give me a single SQL statement using the tables described below. <create table statements included here>
Copilot:
SELECT
Id AS SiteId,
Url AS SiteUrl,
RootWeb_Title AS SiteTitle,
StorageMetrics_TotalSize AS TotalSizeInBytes,
CAST(StorageMetrics_TotalSize / (1024.0 * 1024 * 1024) AS DECIMAL(10,2)) AS TotalSizeInGB
FROM
SharePointSites
WHERE
StorageMetrics_TotalSize > (5 * 1024 * 1024 * 1024)
ORDER BY
StorageMetrics_TotalSize DESC;
Note: Interesting choice of columns and order, without any specific prompting to that effect.
Scenario 3
Now for a more challenging scenario using two tables and a more complex requirement.
Me:
Given the tables below, write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. <create table statements for Sites and Files included here>
Copilot:
WITH ExtensionTotals AS (
SELECT
f.SiteId,
f.Extension,
SUM(f.SizeInBytes) AS TotalSize
FROM SharePointFiles f
WHERE f.Extension IS NOT NULL
GROUP BY f.SiteId, f.Extension
),
RankedExtensions AS (
SELECT
et.SiteId,
et.Extension,
et.TotalSize,
ROW_NUMBER() OVER (PARTITION BY et.SiteId ORDER BY et.TotalSize DESC) AS rn
FROM ExtensionTotals et
),
TopExtensions AS (
SELECT
SiteId,
STRING_AGG(Extension + ' (' + CAST(TotalSize AS VARCHAR(20)) + ')', ', ') WITHIN GROUP (ORDER BY TotalSize DESC) AS Top3Extensions
FROM RankedExtensions
WHERE rn <= 3
GROUP BY SiteId
)
SELECT
s.RootWeb_Title AS SiteName,
s.StorageMetrics_TotalSize AS TotalStorageSize,
te.Top3Extensions
FROM SharePointSites s
LEFT JOIN TopExtensions te ON s.Id = te.SiteId
ORDER BY s.StorageMetrics_TotalSize DESC;
Notes: Scenario is nicely handled, using both the Files and Sites tables. Complex aggregation is coded correctly.
Scenario 4
In this one, I skip the step of using a SQL CREATE TABLE statement and go directly from the markdown for the two schemas (Sites and Files) to a SQL SELECT statement. I also let GPT-5 add his comments and notes this time, to make sure it's thinking correctly...
Me:
Consider a SQL table for SharePoint Sites based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md - This dataset is flattened. There is one column in the SQL table for every row, except those of type object. Column names are flattened with a prefix. Column names use an underline to separate the prefix. Also consider a SQL table for SharePoint Files based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointfiles.md - Use the same assumptions as the previous one. Now write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size.
Copilot:
Got it — you want to join the flattened SharePoint Sites table with the flattened SharePoint Files table, aggregate storage usage, and then find the top 3 file extensions by size for each site. Here’s a SQL example that works with the flattened column naming convention you described:WITH FileSizeByExtension AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalExtensionSize FROM SharePointFiles f GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT SiteId, Extension, TotalExtensionSize, ROW_NUMBER() OVER (PARTITION BY SiteId ORDER BY TotalExtensionSize DESC) AS rn FROM FileSizeByExtension ), Top3Extensions AS ( SELECT SiteId, STRING_AGG(Extension, ', ' ORDER BY TotalExtensionSize DESC) AS TopExtensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageBytes, t.TopExtensions FROM SharePointSites s LEFT JOIN Top3Extensions t ON s.Id = t.SiteId ORDER BY s.StorageMetrics_TotalSize DESC;
How it works
FileSizeByExtension: Groups files by SiteId and Extension to sum their sizes.
RankedExtensions: Uses ROW_NUMBER() to rank extensions by size within each site.
Top3Extensions: Filters to the top 3 per site and concatenates them into a comma-separated list.
Final SELECT: Joins back to the Sites table to get the site name and total storage size.
Notes
RootWeb_Title comes from flattening the RootWeb object in the Sites dataset.
StorageMetrics_TotalSize is the flattened column for total site storage (excluding recycle bin).
STRING_AGG works in SQL Server 2017+, Azure SQL, and some other RDBMS.
If your SQL dialect doesn’t support it, you’d need a different aggregation method.
This query orders sites by total storage size so the largest sites appear first.
Note: Nicely done in one prompt with the proper explanations.
Conclusion
I hope these examples help you see the potential of getting Copilot's help with your queries. This is one more reason to load your MGDC for SharePoint tables directly into Microsoft Fabric, where you can run SQL queries against tables in the OneLake...