MGDC
37 TopicsCapacity Template v2 with Microsoft Fabric
1. Capacity Scenario One of the most common scenarios for Microsoft Graph Data Connect (MGDC) for SharePoint is Capacity. This scenario focuses on identifying which sites and files are using the most storage, along with understanding the distribution of these large sites and files by properties like type and age. The MGDC datasets for this scenario are SharePoint Sites and SharePoint Files. If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets. To assist you in using these datasets, the team has developed a Capacity Template. Initially published as a template for Azure Synapse, we now have a new Microsoft Fabric template that is simpler and offers more features. This SharePoint Capacity v2 Template, based on Microsoft Fabric, is now publicly available. 2. Instructions The template comes with a set of detailed instructions at https://aka.ms/fabriccapacitytemplatesteps. These instructions include: How to install the Microsoft Fabric and Microsoft Graph Data Connect prerequisites How to import the pipeline template from the Microsoft Fabric gallery and set it up How to import the Power BI template and configure the data source settings See below some additional details about the template. 3. Microsoft Fabric Pipeline After you import the pipeline template, it will look like this: 4. Pipeline in Microsoft Fabric The Capacity template for Microsoft Fabric includes a few key improvements: The new template uses delta datasets to update the SharePoint Sites and SharePoint Files datasets. It keeps track of the last time the datasets were pulled by this pipeline, requesting just what changed since then. The new template uses views to do calculations and create new properties like size bands or date bands. In our previous template, this was done in Power Query, when importing into Power BI. The new template also uses a view to aggregate file data, grouping the data by file extension. You can find details on how to find and deploy the Microsoft Fabric template in the instructions (see item 3). 5. Microsoft Fabric Report The typical result from this solution is a set of Power BI dashboards pulled from the Microsoft Fabric data source. Here are some examples: These dashboards serve as examples or starting points and can be modified as necessary for various visualizations of the data within these datasets. The instructions (see item 3) include details on how to find and deploy a few sample Power BI Capacity templates. 6. Conclusion I hope this provides a good overview of the Capacity template for Microsoft Fabric. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.Using Copilot (GPT-5) to turn schema into SQL and write queries
This article shows a few ways you can use Copilot to help with Microsoft Graph Data Connect for SharePoint. This includes turning a schema into a SQL CREATE TABLE statement and summarizing data from that. Make sure to use the Copilot option to use GPT-5 for best results. Scenario 1 This first scenario just builds a CREATE TABLE statement from the dataset schema definition in markdown notation. I provide a few instructions on how to handle the flattening of the objects. Me: Given the schema at this URL: https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md, give me a CREATE TABLE statement for a SQL Server. Just include the single SQL command in the output. Each column should have a comment with a short description. Don't include any information about keys or constraints. This dataset is flattened. One column in the table for every row, except those of type object. Column names to be flattened with a prefix. Use an underline to separate the prefix. Copilot: CREATE TABLE SharePointSites ( ptenant NVARCHAR(255) -- Id of the tenant ,Id NVARCHAR(255) -- GUID of the site ,Url NVARCHAR(2048) -- URL for the site ,ArchiveState NVARCHAR(50) -- Archive state of the site: None, Archiving, Archived, or Reactivating ,SiteArchivedBy NVARCHAR(255) -- Who archived the site, null if not archived ,SiteArchivedTime DATETIME -- When the site was archived, null if not archived ,RootWeb_Id NVARCHAR(255) -- Root web id ,RootWeb_Title NVARCHAR(512) -- Root web title ,RootWeb_WebTemplate NVARCHAR(255) -- Root web template name ,RootWeb_WebTemplateId INT -- Root web template id ,RootWeb_Configuration INT -- Root web template configuration id ,RootWeb_LastItemModifiedDate DATETIME -- Date when an item in the root web was last modified ,WebCount BIGINT -- Number of webs (subsites) in the site ,StorageQuota BIGINT -- Total storage in bytes allowed for this site ,StorageUsed BIGINT -- Total storage in bytes used by this site ,StorageMetrics_MetadataSize BIGINT -- Total metadata size for the site in bytes ,StorageMetrics_TotalFileCount BIGINT -- Total number of files for the site ,StorageMetrics_TotalFileStreamSize BIGINT -- Total size of latest version of files in bytes ,StorageMetrics_AdditionalFileStreamSize BIGINT -- Total size of additional file streams in bytes ,StorageMetrics_TotalSize BIGINT -- Total size of all files for the site in bytes ,GroupId NVARCHAR(255) -- Id of the group associated with this site ,GeoLocation NVARCHAR(50) -- Geographic region where the data is stored ,IsInRecycleBin BIT -- Indicates if the site is in the recycle bin ,RecycleBinItemCount BIGINT -- Number of items in the recycle bin ,RecycleBinItemSize BIGINT -- Size of items in the recycle bin ,SecondStageRecycleBinStorageUsage BIGINT -- Size of items in the second stage recycle bin ,IsTeamsConnectedSite BIT -- Indicates if the site is connected to Teams ,IsTeamsChannelSite BIT -- Indicates if the site is a channel site ,TeamsChannelType NVARCHAR(50) -- Type of channel if site is a Teams channel site ,IsHubSite BIT -- Indicates if the site is associated with a hub site ,HubSiteId NVARCHAR(255) -- Id of the hub site for this site ,IsCommunicationSite BIT -- Indicates if the site is a communication site ,IsOneDrive BIT -- Indicates if the site is a OneDrive ,BlockAccessFromUnmanagedDevices BIT -- Site blocks access from unmanaged devices ,BlockDownloadOfAllFilesOnUnmanagedDevices BIT -- Site blocks download of all files from unmanaged devices ,BlockDownloadOfViewableFilesOnUnmanagedDevices BIT -- Site blocks download of viewable files from unmanaged devices ,ShareByEmailEnabled BIT -- Site enables share by e-mail ,ShareByLinkEnabled BIT -- Site enables share by link ,IsExternalSharingEnabled BIT -- Indicates if external sharing is enabled ,SiteConnectedToPrivateGroup BIT -- Indicates if site is connected to a Private Group ,Privacy NVARCHAR(50) -- Privacy of the site: Private or Public ,SensitivityLabelInfo_Id NVARCHAR(255) -- Id of the Sensitivity Label for the site ,SensitivityLabelInfo_DisplayName NVARCHAR(255) -- Display name of the Sensitivity Label for the site ,Classification NVARCHAR(255) -- Classification of the site ,IBMode NVARCHAR(50) -- Information Barriers Mode ,IBSegments NVARCHAR(MAX) -- List of organization segments if IB mode is Explicit ,Owner_AadObjectId NVARCHAR(255) -- AAD Object Id of the owner ,Owner_Email NVARCHAR(320) -- Email of the owner ,Owner_UPN NVARCHAR(320) -- User Principal Name of the owner ,Owner_Name NVARCHAR(255) -- Name of the owner ,SecondaryContact_AadObjectId NVARCHAR(255) -- AAD Object Id of the secondary contact ,SecondaryContact_Email NVARCHAR(320) -- Email of the secondary contact ,SecondaryContact_UPN NVARCHAR(320) -- User Principal Name of the secondary contact ,SecondaryContact_Name NVARCHAR(255) -- Name of the secondary contact ,ReadLocked BIT -- Whether the site is locked for read access ,ReadOnly BIT -- Whether the site is in read-only mode ,CreatedTime DATETIME -- When the site was created (UTC) ,LastSecurityModifiedDate DATETIME -- When security on the site was last changed (UTC) ,LastUserAccessDate DATETIME -- Last access by a real user (UTC) ,LastContentChange DATETIME -- When the site contents were last changed (UTC) ,SnapshotDate DATETIME -- When this site information was captured (UTC) ,Operation NVARCHAR(50) -- Extraction mode of this row ); Scenario 2 Now I use the CREATE TABLE as the base to ask for a SELECT statement. Me: Give me a list of sites that have more than 5GB of files. Give me a single SQL statement using the tables described below. <create table statements included here> Copilot: SELECT Id AS SiteId, Url AS SiteUrl, RootWeb_Title AS SiteTitle, StorageMetrics_TotalSize AS TotalSizeInBytes, CAST(StorageMetrics_TotalSize / (1024.0 * 1024 * 1024) AS DECIMAL(10,2)) AS TotalSizeInGB FROM SharePointSites WHERE StorageMetrics_TotalSize > (5 * 1024 * 1024 * 1024) ORDER BY StorageMetrics_TotalSize DESC; Note: Interesting choice of columns and order, without any specific prompting to that effect. Scenario 3 Now for a more challenging scenario using two tables and a more complex requirement. Me: Given the tables below, write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. <create table statements for Sites and Files included here> Copilot: WITH ExtensionTotals AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalSize FROM SharePointFiles f WHERE f.Extension IS NOT NULL GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT et.SiteId, et.Extension, et.TotalSize, ROW_NUMBER() OVER (PARTITION BY et.SiteId ORDER BY et.TotalSize DESC) AS rn FROM ExtensionTotals et ), TopExtensions AS ( SELECT SiteId, STRING_AGG(Extension + ' (' + CAST(TotalSize AS VARCHAR(20)) + ')', ', ') WITHIN GROUP (ORDER BY TotalSize DESC) AS Top3Extensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageSize, te.Top3Extensions FROM SharePointSites s LEFT JOIN TopExtensions te ON s.Id = te.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; Notes: Scenario is nicely handled, using both the Files and Sites tables. Complex aggregation is coded correctly. Scenario 4 In this one, I skip the step of using a SQL CREATE TABLE statement and go directly from the markdown for the two schemas (Sites and Files) to a SQL SELECT statement. I also let GPT-5 add his comments and notes this time, to make sure it's thinking correctly... Me: Consider a SQL table for SharePoint Sites based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md - This dataset is flattened. There is one column in the SQL table for every row, except those of type object. Column names are flattened with a prefix. Column names use an underline to separate the prefix. Also consider a SQL table for SharePoint Files based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointfiles.md - Use the same assumptions as the previous one. Now write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. Copilot: Got it — you want to join the flattened SharePoint Sites table with the flattened SharePoint Files table, aggregate storage usage, and then find the top 3 file extensions by size for each site. Here’s a SQL example that works with the flattened column naming convention you described: WITH FileSizeByExtension AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalExtensionSize FROM SharePointFiles f GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT SiteId, Extension, TotalExtensionSize, ROW_NUMBER() OVER (PARTITION BY SiteId ORDER BY TotalExtensionSize DESC) AS rn FROM FileSizeByExtension ), Top3Extensions AS ( SELECT SiteId, STRING_AGG(Extension, ', ' ORDER BY TotalExtensionSize DESC) AS TopExtensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageBytes, t.TopExtensions FROM SharePointSites s LEFT JOIN Top3Extensions t ON s.Id = t.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; How it works FileSizeByExtension: Groups files by SiteId and Extension to sum their sizes. RankedExtensions: Uses ROW_NUMBER() to rank extensions by size within each site. Top3Extensions: Filters to the top 3 per site and concatenates them into a comma-separated list. Final SELECT: Joins back to the Sites table to get the site name and total storage size. Notes RootWeb_Title comes from flattening the RootWeb object in the Sites dataset. StorageMetrics_TotalSize is the flattened column for total site storage (excluding recycle bin). STRING_AGG works in SQL Server 2017+, Azure SQL, and some other RDBMS. If your SQL dialect doesn’t support it, you’d need a different aggregation method. This query orders sites by total storage size so the largest sites appear first. Note: Nicely done in one prompt with the proper explanations. Conclusion I hope these examples help you see the potential of getting Copilot's help with your queries. This is one more reason to load your MGDC for SharePoint tables directly into Microsoft Fabric, where you can run SQL queries against tables in the OneLake...Book of Archimedes, Copilot and SharePoint Agents
1. Introduction The Book of Archimedes is a single PDF that contains all posts in the Microsoft Graph Data Connect for SharePoint blog. This includes 40 articles and nearly 400 pages, which is handy for searching for specific keywords quickly across the many articles. The PDF is provided as an attachment to the main MGDC for SharePoint article at https://aka.ms/SharePointData. You can also download it directly from Book of Archimedes. 2. Using the Book with Copilot Beyond searching inside the PDF, you can feed the Book of Archimedes as grounding to your favorite AI tool to get answers to your questions about MGDC for SharePoint. Let's look at an example. You can use this with Copilot or any AI tool that accepts attached files. In my case, I used https://copilot.cloud.microsoft/ to access my work Copilot, the secure and compliant Copilot integrated with my enterprise account. I start with the “+” button and then the “add content” button. Then I used the up-arrow icon to “upload from this device” and selected the local copy of the Book of Archimedes I downloaded earlier. After that, I added my question as a prompt. I tried “What is Project Archimedes?” I got a summary of Microsoft Graph Data Connect for SharePoint, including key scenarios. From there, I could ask any question. If the book contains the answer, Copilot will find it for you. Project Archimedes is the early codename for Microsoft Graph Data Connect for SharePoint. This project aims to deliver rich data assets to OneDrive and SharePoint tenants, enabling them to run their own analytics, derive insights from their data, and understand how they use these products. The data is transferred to an Azure account owned by the tenant, where tools like Azure Synapse, Power BI, or Microsoft Fabric can be used to transform this data into insightful reports and dashboards [1]. Key Scenarios of Project Archimedes: Security: Understanding permissions in SharePoint and OneDrive tenants to ensure content is properly protected. Capacity: Understanding site lifecycle, ownership, and storage used by SharePoint sites and OneDrives. Sync Health: Ensuring users are properly using OneDrive for Business to protect their files by synchronizing them with the cloud [1]. Datasets Provided: Sites: Information about SharePoint sites. Groups: Details about SharePoint groups. Permissions: Information about sharing permissions. Files: Data about files stored in SharePoint and OneDrive. File Actions: Details about actions performed on files. Sync Health: Information about devices running OneDrive Sync. Sync Errors: Details about errors encountered by OneDrive Sync clients [1]. Example Use Cases: Security Scenario: Answering questions about oversharing, external sharing, sensitive data sharing, and permissions. Capacity Scenario: Tracking storage usage, site lifecycle, and identifying large or inactive sites. Sync Health Scenario: Monitoring device health, folder backup status, and sync errors [1]. For more detailed information, you can refer to the Book of Archimedes which contains comprehensive guides, FAQs, and step-by-step instructions related to Microsoft Graph Data Connect for SharePoint [1]. [1]: Book of Archimedes.pdf The more impressive capabilities of Copilot are shown when you ask for something that is not just a direct quote from the Book. For instance, I asked about pricing for a specific scenario: "If a tenant has 1,000 SharePoint sites, how much would it cost to get the SharePoint Sites and SharePoint permissions for that tenant using MGDC?" To estimate the cost of getting the SharePoint Sites and SharePoint Permissions datasets for a tenant using Microsoft Graph Data Connect (MGDC), you can follow these guidelines: Pricing Information: SharePoint Sites and Permissions: The price to pull 1,000 objects from the available MGDC for SharePoint datasets (Sites, Groups, and Permissions) in the US is $0.75, plus the cost for infrastructure like Azure Storage, Azure Data Factory, or Azure Synapse [1]. SharePoint Files: This dataset is billed at $0.75 per 50,000 objects due to its typical high volume [1]. Estimating Costs: Number of Objects: Sites: The number of objects in your SharePoint Sites dataset will be equal to the number of sites in your tenant. Permissions: The average number of SharePoint permissions per site is around 61 [1]. Cost Calculation: Example Calculation: For a tenant with 1,000 sites: Sites Dataset Cost: $0.75 Permissions Dataset Cost: $45.75 Total Cost: $0.75 + $45.75 = $46.50 Additional Costs: Azure Storage: Costs for storing the data in Azure Storage. Azure Synapse: Costs for running pipelines to move the data. For more precise estimates, you can sample the dataset to get a total object count without pulling the entire dataset [1]. [1]: Book of Archimedes.pdf Copilot will also answer questions in other languages. You just ask the question in whatever language you want, and it gets detected… Here’s an example using Brazilian Portuguese. I asked "Quais são os datasets que fazem parte do cenário de segurança?" Os datasets que fazem parte do cenário de segurança no Project Archimedes são: Sites: Informações sobre os sites do SharePoint. Permissions: Informações sobre as permissões de compartilhamento. Groups: Detalhes sobre os grupos do SharePoint. Esses datasets ajudam a entender se o conteúdo está devidamente protegido, respondendo a perguntas como: Está ocorrendo compartilhamento excessivo? Está ocorrendo compartilhamento externo? Dados sensíveis estão sendo compartilhados? Quanto compartilhamento ocorre por rótulo de sensibilidade? Dados sensíveis são compartilhados com usuários externos? Quais domínios externos estão sendo compartilhados? Quais sites foram mais compartilhados? Quais níveis de compartilhamento estão sendo usados? Quais permissões um usuário específico possui? Quais extensões de arquivo são mais compartilhadas? Quanto compartilhamento ocorre no nível de Web, Pasta, Lista ou Arquivo? Esses datasets são fundamentais para garantir a segurança e a conformidade dos dados dentro do ambiente do SharePoint e OneDrive [1]. [1]: Book of Archimedes.pdf 3. Using the Book to create a SharePoint agent You can also use the book to create a SharePoint agent. This will make it easy to ask questions about Microsoft Graph Data Connect for SharePoint without having to upload the PDF every time. To start, upload the Book of Archimedes to a SharePoint document library. Next, select the file and use the option to “create an agent”. That’s it. Your agent is created, and you can start using it immediately. The agent shows up in the document library as a file with the “.agent” extension. You can simply open that file in SharePoint to run the agent. Once you open the agent, it offers you a text box to ask questions. Here is the agent response to “What are the main questions about SharePoint capacity that MGDC for SharePoint can answer?” The Book of Archimedes provides detailed answers to several key questions about SharePoint capacity using Microsoft Graph Data Connect (MGDC). Here are the main questions it addresses: Key Questions about SharePoint Capacity: How much storage is being used by SharePoint sites? This includes understanding the total size of the latest versions of files, previous versions, metadata size, and deleted files in the recycle bin 1. What are the different storage sizes reported for a SharePoint site? Storage Used: Includes all storage, metadata, previous versions, and recycle bins. Total Size: Includes metadata and previous versions but excludes recycle bins. Metadata Size: Storage used by file metadata. Total File Stream Size: Size of the latest versions of files. Previous Versions Size: Size of previous versions of files. Recycle Bin Size: Size of deleted files in the recycle bins 1. How can administrators check the amount of storage being used by SharePoint sites? SharePoint client UX: Viewing storage while using SharePoint as a site user or Site Collection administrator. PowerShell cmdlets: Using command-line tools to query site storage information. SharePoint Admin Center reports: Viewing usage reports in the Admin Center. Microsoft Graph Data Connect: Pulling detailed datasets for SharePoint sites and OneDrives 1. How can administrators create test sites to understand storage reporting tools? Creating different types of test sites (empty site, site with files, site with deleted files, site with previous versions) and analyzing how storage sizes are reported 1. How can administrators estimate Azure costs for MGDC usage? Estimating the number of objects (sites, groups, permissions, files) and understanding the billing rates for different datasets 1. How can administrators use MGDC to run detailed analysis of SharePoint and OneDrive storage metrics? Using detailed data in MGDC to analyze storage metrics and understand the impact of metadata, versions, and deleted files 1. Tools and Templates: Capacity Template: Includes instructions, ARM template, Azure Synapse pipeline template, and Power BI template to help administrators use MGDC datasets for capacity analysis 1. Conclusion: The Book of Archimedes provides comprehensive guidance on leveraging MGDC to gather, store, and analyze SharePoint data, ensuring effective data management and security. It is a valuable resource for administrators and developers working with SharePoint and OneDrive data using MGDC 1. Would you like more specific information on any of these questions or tools? 1: <File>Book of Archimedes</File> As we did with Copilot, you can also ask any question on a topic addressed in the book. You can also use languages other than English. You can experiment with editing the agent to rename it, change the description, add more sources, adjust its behavior, offer sample starter prompts, include specific instructions or even customize with Copilot Studio. Just use the option to "edit the agent". You can also share the agent just like you share any file in SharePoint. You can learn more about SharePoint agents in this YouTube video by Vesa Juvonen: https://www.youtube.com/watch?v=pYvK76SVEv4 4. Conclusion I hope you enjoy using the Book of Archimedes as grounding for your AI tools. As usual, remember that AI tools may make mistakes and you should always double-check the answers you receive. There is also a chance that a particular AI tool might not be able to handle a large document like the Book of Archimedes, currently weighing in at around 10MB.Understanding the Notebooks in the Oversharing Template v2 (Microsoft Fabric)
Introduction The Microsoft Graph Data Connect for SharePoint team published two notebooks used with Microsoft Fabric in the Information Oversharing v2 template. This blog explains what each code block inside these notebooks does. The goal was to help you understand what the notebooks do. Note that this document was written with help from Copilot, using simple prompts like “Analyze each section of this Jupyter notebook with PySpark and Scala code. Describe what each section does.” Notebook 1: Read Last Snapshot Dates This first notebook runs right as the pipeline starts. It checks the environment, verifies if the Sites and Permission tables exist in the Lakehouse, checks the last day data was gathered from MGDC and calculates the start and end date to use. It also cleans the staging tables and stores a few commands that are used in later steps. Section 0 – Set the Default Lakehouse for Notebook to Run %%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } } This section uses the %%configure magic command to set a JSON configuration that defines a parameter (lakehouseName) with the default value "defaultlakehousename". This setting ensures that when the notebook is launched through a pipeline, it dynamically selects the target Lakehouse. Section 1 – Initialize Parameters import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar import java.sql.Timestamp val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = "IMAXDefault" val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" val endTime = "2024-11-15T00:00:00Z" spark.conf.set("spark.sql.caseSensitive", true) This section imports various libraries for date/time handling and initializes key parameters for the ETL process. These include a run identifier (runId), workspace and Lakehouse information (with some values coming from Spark configuration), table names for staging and final datasets, and a fallback endTime. It also enforces case sensitivity in Spark SQL. Section 2 – Checking Required Final Tables Exist or Not val lakehouse = mssparkutils.lakehouse.get(lakehouseName) val lakehouseId = lakehouse.id val workspaceName = notebookutils.runtime.context("currentWorkspaceName") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val sitesStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesStagingTableName}" val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val tables = spark.catalog.listTables() val siteTableCount = tables.filter(col("name") === lit(sitesFinalTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsTableCount = tables.filter(col("name") === lit(permissionsFinalTableName) and array_contains(col("namespace"), lakehouseName)).count() val siteStagingTableCount = tables.filter(col("name") === lit(sitesStagingTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsStagingTableCount = tables.filter(col("name") === lit(permissionsStagingTableName) and array_contains(col("namespace"), lakehouseName)).count() This section retrieves the Lakehouse object and uses it to construct ABFS paths for both staging and final tables (for Sites and Permissions). It then checks for the existence of these tables by listing them in Spark’s catalog and filtering by name and namespace. Section 3 – Getting Snapshot Dates from Last Successful Extracts import org.apache.spark.sql.functions.{col, _} import org.apache.spark.sql.types._ import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val dtCurrentDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.S") val dtRequiredtDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss'Z'") var siteDataExists: Boolean = false var permissionsDataExists: Boolean = false val siteSnapshotDate = { if (siteTableCount == 1) { val dfSites = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${sitesFinalTableName} ") val rowSites: Row = dfSites.select("SnapshotDate").head(1)(0) if (rowSites.get(0) == null) endTime else { siteDataExists = true println(s"Sites data Exists: ${siteDataExists}") LocalDateTime.parse(rowSites.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } } val permissionsSnapshotDate = { if (permissionsTableCount == 1) { val dfPermissions = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${permissionsFinalTableName} ") val rowPermissions: Row = dfPermissions.select("SnapshotDate").head(1)(0) if (rowPermissions.get(0) == null) endTime else { permissionsDataExists = true println(s"Permissions data Exists: ${permissionsDataExists}") LocalDateTime.parse(rowPermissions.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } } This section queries the final tables to retrieve the latest SnapshotDate for both Sites and Permissions. It then reformats the date into an ISO-compliant format. If no snapshot date is found, it defaults to the predefined endTime, and two boolean flags (siteDataExists and permissionsDataExists) are toggled accordingly. Section 4 – Generate View Script for Sites val sitesView: String = s""" CREATE OR ALTER VIEW vw${sitesFinalTableName} AS SELECT *, [StorageQuotaFriendly] = (case when StorageQuota < 1048576 then concat(ceiling(StorageQuota / 1024.0), ' KB') when StorageQuota < 1073741824 then concat(ceiling(StorageQuota / 1048576.0), ' MB') when StorageQuota < 1099511627776 then concat(ceiling(StorageQuota / 1073741824.0), ' GB') when StorageQuota < 1125899906842624 then concat(ceiling(StorageQuota / 1099511627776.0), ' TB') else concat(ceiling(StorageQuota / 1125899906842624.0), ' PB') end ), [StorageUsedFriendly] = (case when StorageUsed < 1048576 then concat(ceiling(StorageUsed / 1024.0), ' KB') when StorageUsed < 1073741824 then concat(ceiling(StorageUsed / 1048576.0), ' MB') when StorageUsed < 1099511627776 then concat(ceiling(StorageUsed / 1073741824.0), ' GB') when StorageUsed < 1125899906842624 then concat(ceiling(StorageUsed / 1099511627776.0), ' TB') else concat(ceiling(StorageUsed / 1125899906842624.0), ' PB') end ) FROM ${sitesFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(sitesView) Here a SQL view (vwSites) is dynamically generated for the Sites final table. The view adds two computed columns (StorageQuotaFriendly and StorageUsedFriendly) that convert byte values into more digestible units such as KB, MB, GB, etc. This script will be stored and executed later. Section 5 – Generate View Script for Permissions val permissionsView: String = s""" CREATE OR ALTER VIEW vw${permissionsFinalTableName} AS SELECT *, ShareeDomain = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 AND CHARINDEX('.', SharedWith_Email) > 0 THEN SUBSTRING(SharedWith_Email,CHARINDEX('@', SharedWith_Email)+1,LEN(SharedWith_Email)) ELSE '' END, ShareeEMail = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 THEN SharedWith_Email ELSE '' END, PermissionsUniqueKey = CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000')), EEEUPermissionsCount = SUM(CASE WHEN SharedWith_Name LIKE 'Everyone except external users' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), ExternalUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'External' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), B2BUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'B2BUser' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ) FROM ${permissionsFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(permissionsView) This section builds a SQL view (vwPermissions) for the Permissions final table. It derives additional columns like ShareeDomain, ShareeEMail, and a composite key (PermissionsUniqueKey) while applying window functions to compute counts (e.g., for external or B2B users). This script will also be stored and executed later. Section 6 – Truncate the Staging Tables from Previous Runs if (siteStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${sitesStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${sitesStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${sitesFinalTableName} not found") } if (permissionsStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${permissionsStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${permissionsStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${permissionsStagingTableName} not found") } This section checks if the staging tables exist (by count) and, if found, issues a SQL DELETE command to remove existing data so that new data can be loaded. It prints messages indicating the action taken. Section 7 – Return Snapshot Dates Back to Pipeline import mssparkutils.notebook val returnData = s"""{\"LakehouseId\": \"${lakehouseId}\", \"SitesStagingTableName\": \"${sitesStagingTableName}\", \"SitesFinalTableName\": \"${sitesFinalTableName}\", \"SitesSnapshotDate\": \"${siteSnapshotDate}\", \"SitesDataExists\": ${siteDataExists}, \"SitesView\": \"${sitesView}\", \"PermissionsStagingTableName\": \"${permissionsStagingTableName}\", \"PermissionsFinalTableName\": \"${permissionsFinalTableName}\", \"PermissionsSnapshotDate\": \"${permissionsSnapshotDate}\", \"EndSnapshotDate\": \"${endTime}\", \"PermissionsDataExists\": ${permissionsDataExists}, \"PermissionsView\": \"${permissionsView}\"}""" println(returnData) mssparkutils.notebook.exit(returnData) This concluding section aggregates the key metadata—including Lakehouse information, table names, snapshot dates, existence flags, and the generated view scripts—into a JSON string. It then exits the notebook by returning that JSON to the pipeline. Notebook 2: Merge Sites and Permissions to Final Table This notebook runs after the Sites and Permissions data from MGDC has been collected successfully into the staging tables. If this is the first collection, it handles them as full datasets, storing the data directly in the final tables. If this is using MGDC for SharePoint delta datasets, it merges the new, updated or deleted objects from the staging tables into the final tables. Note: The word "Delta" here might refer to Delta Parquet (an efficient data storage format used by tables in a Microsoft Fabric Lakehouse) or to the MGDC for SharePoint Delta datasets (how MGDC can return only the objects that are new, updated or deleted between two dates). It can be a bit confusing, so be aware of the two interpretations of the word. Section 0 – Set the Default Lakehouse for Notebook to Run %%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } } This section uses the same Lakehouse configuration as in Notebook 1. It sets the default Lakehouse through a parameter (lakehouseName) to support dynamic running of the notebook in different environments. Section 1 – Initialize Parameters import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = spark.conf.get("trident.lakehouse.name") val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" spark.conf.set("spark.sql.caseSensitive", true) This section is like Notebook 1’s Section 1 but here lakehouseName is retrieved from the configuration. It initializes variables needed for merging, such as run IDs, workspace/Lakehouse identifiers, and table names. Section 2 – Read Sites Dataset from Staging Table val lakehouse = mssparkutils.lakehouse.get(lakehouseName) val lakehouseId = lakehouse.id val workspaceName = notebookutils.runtime.context("currentWorkspaceName") println("Started reading Sites dataset") val sitesStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesStagingTableName}" val dfSitesStaging = spark.read.format("delta").load(sitesStagingLocation) println("Completed reading Sites dataset") This section constructs the ABFS path for the Sites staging table and reads the dataset into a DataFrame using the Delta Parquet format. It includes print statements to track progress. Section 3 – Read Permissions Dataset from Staging Table println("Started reading Permissions dataset") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val dfPermissionsStaging = spark.read.format("delta").load(permissionsStagingLocation) println("Completed reading Permissions dataset") This section performs the analogous operation for the Permissions staging table, loading the dataset into a DataFrame and providing console output for monitoring. Section 4 – Check Final Tables Exist or Not import io.delta.tables.DeltaTable val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val sitesFinalTableExists = DeltaTable.isDeltaTable(spark, sitesFinalLocation) if (!sitesFinalTableExists) { println("Final Sites table not exists. Creating final Sites table with schema only") dfSitesStaging.filter("1=2").write.format("delta").mode("overwrite").save(sitesFinalLocation) println("Final Sites table created") } else { println("Final Sites table exists already") } val permissionsFinalTableExists = DeltaTable.isDeltaTable(spark, permissionsFinalLocation) if (!permissionsFinalTableExists) { println("Final Permissions table not exists. Creating final Permissions table with schema only") dfPermissionsStaging.filter("1=2").write.format("delta").mode("overwrite").save(permissionsFinalLocation) println("Final Permissions table created") } else { println("Final Permissions table exists already") } This section checks whether the final tables for Sites and Permissions exist. If a table does not exist, it creates an empty table (schema only) from the staging DataFrame by filtering out data (filter("1=2")). Section 5 – Merge Sites Data from Staging Table to Final Table import io.delta.tables._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.{Window, WindowSpec} import org.apache.spark.sql.functions.{coalesce, lit, sum, col, _} import org.apache.spark.sql.types.{StructField, _} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val deltaTableSource = DeltaTable.forPath(spark, sitesStagingLocation) val deltaTableTarget = DeltaTable.forPath(spark, sitesFinalLocation) import spark.implicits._ val dfSource = deltaTableSource.toDF //Delete records that have Operation as Deleted println("Merging Sites dataset from current staging table") deltaTableTarget .as("target") .merge( dfSource.as("source"), "source.Id = target.Id") .whenMatched("source. Operation = 'Deleted'") .delete() .whenMatched("source.Operation != 'Deleted'") .updateAll() .whenNotMatched("source.Operation != 'Deleted'") .insertAll() .execute() println("Merging of Sites dataset completed") This section performs a Delta Lake merge (upsert) operation on the Sites data. The merge logic deletes matching records when the source’s Operation is 'Deleted', updates other matching records, and inserts new records that are not marked as 'Deleted'. Section 6 – Merge Permissions Data from Staging Table to Final Table import io.delta.tables._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.{Window, WindowSpec} import org.apache.spark.sql.functions.{coalesce, lit, sum, col, _} import org.apache.spark.sql.types.{StructField, _} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val deltaTablePermissionsSource = DeltaTable.forPath(spark, permissionsStagingLocation) val deltaTablePermissionsTarget = DeltaTable.forPath(spark, permissionsFinalLocation) import spark.implicits._ val dfPermissionsSource = deltaTablePermissionsSource.toDF //Delete records that have Operation as Deleted println("Merging Permissions dataset from current staging table") deltaTablePermissionsTarget .as("target") .merge( dfPermissionsSource.as("source"), """source.SiteId = target.SiteId and source.ScopeId = target.ScopeId and source.LinkId = target.LinkId and source.RoleDefinition = target.RoleDefinition and coalesce(source.SharedWith_Name,"") = coalesce(target.SharedWith_Name,"") and coalesce(source.SharedWith_TypeV2,"") = coalesce(target.SharedWith_TypeV2,"") and coalesce(source.SharedWith_Email,"") = coalesce(target.SharedWith_Email,"") and coalesce(source.SharedWith_AADObjectId,"") = coalesce(target.SharedWith_AADObjectId,"") """) .whenMatched("source. Operation = 'Deleted'") .delete() .whenMatched("source.Operation != 'Deleted'") .updateAll() .whenNotMatched("source.Operation != 'Deleted'") .insertAll() .execute() println("Merging of Permissions dataset completed") This section performs a merge operation on the Permissions data. The merge condition is more complex—comparing multiple columns (including handling nulls with coalesce) to identify matching records. The operation applies deletion for rows marked as 'Deleted', updates others, and inserts records where no match exists. Section 7 – Read and Display Sample TOP 10 Rows var sqlQuery = s"SELECT * FROM ${lakehouseName}.${sitesFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfSitesAll = spark.sql(sqlQuery) display(dfSitesAll) sqlQuery = s"SELECT * FROM ${lakehouseName}.${permissionsFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfPermissionsAll = spark.sql(sqlQuery) display(dfPermissionsAll) This final section executes SQL queries to retrieve and display the top 10 rows from both the Sites and Permissions final tables. The rows are ordered by SnapshotDate in descending order. This is typically used for sample or debugging purposes. Conclusion I hope this article helped you understand the notebooks included in the template. This might help you customize it later. These templates are intended as starting points for your work with many scenarios. Read more about MGDC for SharePoint at https://aka.ms/SharePointData.MGDC for SharePoint FAQ: How to use Excel for analytics
MGDC for SharePoint – Dataset types and features In this post, we’ll discuss how to use Microsoft Excel to consume the data from Microsoft Graph Data Connect (MGDC) for SharePoint. If you’re new to MGDC, get an overview at https://aka.ms/SharePointData. 1. Introduction In most examples on how to analyze the data from MGDC for SharePoint, we use Azure Synapse in combination with Power BI Desktop, which is a very capable tool that can turn your data into coherent, visually immersive, and interactive insights. But keep in mind that, when pulling datasets from MGDC with Azure Synapse, the data becomes available in an Azure storage account as a set of JSON files. From there, you can use many different tools beyond Power BI. In this article, we will focus on how to read and analyze that data using Microsoft Excel. 2. Get the data into Azure storage The first step in the process is to get your data from MGDC. I have a blog post that describes that process at Step-by-step: (Synapse) Gather a detailed dataset on SharePoint Sites using MGDC. Section 1 if that blog shows how to setup the Microsoft Graph Data Connect and section 2 describes how to run an Azure Synapse pipeline to pull the data. If you complete these first two sections, you will end up with your SharePoint Sites data in Azure. You can skip section 3 of that blog, that describes how to get the data into a Power BI dashboard. You will replace that section with the information in this article, which takes the data to Microsoft Excel instead. 3. Bring the data from Azure storage to Excel Next, you need to bring the data in the JSON files from the Azure storage account to your Excel spreadsheet. Fortunately, Excel has the perfect set of tools to bring in that data. You should start with a blank spreadsheet. Select the “Data” tab in the ribbon and use the option to “Get Data”. Then select the option “from Azure” and choose “From Azure Data Lake Storage Gen2”. This will make sure that you’re getting the data in your ADLS Gen2 storage account. > Get Data --> From Azure --> From Azure Data Lake Storage Gen2 Next, you will need to provide the URL to the ADLS Gen2 storage. Provide the full path, with the container and folder in the format: https://account.dfs.core.windows.net/container/folder. The account name is in step 1d of the at step-by-step guide mentioned before. The container and folder names are in step 1f of the guide. You can also find the URL in the Azure Portal, under the Settings/Endpoints section of the storage account. Once you provide the URL and click OK, you will need to authenticate. Select the option to use the storage account key and type in the key you saved in step 1d of the at step-by-step guide. You can find the key in the Azure Portal, under the “Security + Networking” / “Access Keys” section of the storage account. Once you have typed the account key, click “Connect”. At this point, you want to click on the “Transform Data” button to filter for the specific files you want. This will open the “Power Query Editor” window, where you can apply specific steps to filter the data coming in. The easiest way to filter out the metadata files is to scroll right until you find the “Folder Path” column and filter that column for paths that do not contain the word “metadata”. > down arrow --> Text filters --> Does not contain... After that, you can scroll left to the “Content” column and click on the button to “Combine files”. After that, the query will pull in the content of the files and expand all the columns from the JSON. > Content --> Combine files Finally, you can click on the “Close & Load” button to bring the data into a sheet. & Load With that, you can play with the data in Excel. 4. Create PivotTables Once the data is imported into your Excel spreadsheet, you will be able to create data visualizations from it. The most interesting way to do this is to use Excel PivotTables. For instance, here’s a pivot table with storage (Storage Used) by site type (RootWeb.WebTemplate) on the different dates you pulled data (SnapShotDate). See the sample PivotTable below after some adjustments and formatting. 5. Tips Here are a few additional comments. Distinct types of authentication: As you might have noticed, there are multiple ways to authenticate to get access to the Azure storage account. In addition to using a key to the entire storage account as shown here, you could use Azure to grant specific permissions to a user and log in as that user. This will help restrict access to specific folders and grant limited permissions (like read-only). Create custom columns: To help with your data analysis, you might want to add a few custom or calculated columns. You can in many ways, including transformations in Power Query, formulas in the Excel sheet or in the pivot table itself. You can read more about it at How to create custom columns in Power BI (this says Power BI but applies to Excel as well). More than one million rows: If you’re using this to visualize data from the SharePoint Sites dataset, you are likely to have fewer than one million rows, which is the Microsoft Excel limit. If you’re one of the few tenants that will have more than one million sites or if you’re analyzing a larger dataset like Permissions or Files, you can only pull the first one million rows into a sheet. To overcome this, you can load the data into a model and use that to drive your pivot tables. You can read more about it at Loading CSV/text files with more than a million rows into Excel (this says text files but applies to JSON as well). 6. Conclusion I hope this post helped you understand how to use Microsoft Excel to pull JSON data from an Azure storage account used by MGDC for SharePoint. For more information, visit https://aka.ms/SharePointData.