microsoft fabric
5 TopicsCapacity Template v2 with Microsoft Fabric
1. Capacity Scenario One of the most common scenarios for Microsoft Graph Data Connect (MGDC) for SharePoint is Capacity. This scenario focuses on identifying which sites and files are using the most storage, along with understanding the distribution of these large sites and files by properties like type and age. The MGDC datasets for this scenario are SharePoint Sites and SharePoint Files. If you’re not familiar with these datasets, you can find details in the schema definitions at https://aka.ms/SharePointDatasets. To assist you in using these datasets, the team has developed a Capacity Template. Initially published as a template for Azure Synapse, we now have a new Microsoft Fabric template that is simpler and offers more features. This SharePoint Capacity v2 Template, based on Microsoft Fabric, is now publicly available. 2. Instructions The template comes with a set of detailed instructions at https://aka.ms/fabriccapacitytemplatesteps. These instructions include: How to install the Microsoft Fabric and Microsoft Graph Data Connect prerequisites How to import the pipeline template from the Microsoft Fabric gallery and set it up How to import the Power BI template and configure the data source settings See below some additional details about the template. 3. Microsoft Fabric Pipeline After you import the pipeline template, it will look like this: 4. Pipeline in Microsoft Fabric The Capacity template for Microsoft Fabric includes a few key improvements: The new template uses delta datasets to update the SharePoint Sites and SharePoint Files datasets. It keeps track of the last time the datasets were pulled by this pipeline, requesting just what changed since then. The new template uses views to do calculations and create new properties like size bands or date bands. In our previous template, this was done in Power Query, when importing into Power BI. The new template also uses a view to aggregate file data, grouping the data by file extension. You can find details on how to find and deploy the Microsoft Fabric template in the instructions (see item 3). 5. Microsoft Fabric Report The typical result from this solution is a set of Power BI dashboards pulled from the Microsoft Fabric data source. Here are some examples: These dashboards serve as examples or starting points and can be modified as necessary for various visualizations of the data within these datasets. The instructions (see item 3) include details on how to find and deploy a few sample Power BI Capacity templates. 6. Conclusion I hope this provides a good overview of the Capacity template for Microsoft Fabric. You can read more about the Microsoft Graph Data Connect for SharePoint at https://aka.ms/SharePointData. There you will find many details, including a list of datasets available, other common scenarios and frequently asked questions.Using Copilot (GPT-5) to turn schema into SQL and write queries
This article shows a few ways you can use Copilot to help with Microsoft Graph Data Connect for SharePoint. This includes turning a schema into a SQL CREATE TABLE statement and summarizing data from that. Make sure to use the Copilot option to use GPT-5 for best results. Scenario 1 This first scenario just builds a CREATE TABLE statement from the dataset schema definition in markdown notation. I provide a few instructions on how to handle the flattening of the objects. Me: Given the schema at this URL: https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md, give me a CREATE TABLE statement for a SQL Server. Just include the single SQL command in the output. Each column should have a comment with a short description. Don't include any information about keys or constraints. This dataset is flattened. One column in the table for every row, except those of type object. Column names to be flattened with a prefix. Use an underline to separate the prefix. Copilot: CREATE TABLE SharePointSites ( ptenant NVARCHAR(255) -- Id of the tenant ,Id NVARCHAR(255) -- GUID of the site ,Url NVARCHAR(2048) -- URL for the site ,ArchiveState NVARCHAR(50) -- Archive state of the site: None, Archiving, Archived, or Reactivating ,SiteArchivedBy NVARCHAR(255) -- Who archived the site, null if not archived ,SiteArchivedTime DATETIME -- When the site was archived, null if not archived ,RootWeb_Id NVARCHAR(255) -- Root web id ,RootWeb_Title NVARCHAR(512) -- Root web title ,RootWeb_WebTemplate NVARCHAR(255) -- Root web template name ,RootWeb_WebTemplateId INT -- Root web template id ,RootWeb_Configuration INT -- Root web template configuration id ,RootWeb_LastItemModifiedDate DATETIME -- Date when an item in the root web was last modified ,WebCount BIGINT -- Number of webs (subsites) in the site ,StorageQuota BIGINT -- Total storage in bytes allowed for this site ,StorageUsed BIGINT -- Total storage in bytes used by this site ,StorageMetrics_MetadataSize BIGINT -- Total metadata size for the site in bytes ,StorageMetrics_TotalFileCount BIGINT -- Total number of files for the site ,StorageMetrics_TotalFileStreamSize BIGINT -- Total size of latest version of files in bytes ,StorageMetrics_AdditionalFileStreamSize BIGINT -- Total size of additional file streams in bytes ,StorageMetrics_TotalSize BIGINT -- Total size of all files for the site in bytes ,GroupId NVARCHAR(255) -- Id of the group associated with this site ,GeoLocation NVARCHAR(50) -- Geographic region where the data is stored ,IsInRecycleBin BIT -- Indicates if the site is in the recycle bin ,RecycleBinItemCount BIGINT -- Number of items in the recycle bin ,RecycleBinItemSize BIGINT -- Size of items in the recycle bin ,SecondStageRecycleBinStorageUsage BIGINT -- Size of items in the second stage recycle bin ,IsTeamsConnectedSite BIT -- Indicates if the site is connected to Teams ,IsTeamsChannelSite BIT -- Indicates if the site is a channel site ,TeamsChannelType NVARCHAR(50) -- Type of channel if site is a Teams channel site ,IsHubSite BIT -- Indicates if the site is associated with a hub site ,HubSiteId NVARCHAR(255) -- Id of the hub site for this site ,IsCommunicationSite BIT -- Indicates if the site is a communication site ,IsOneDrive BIT -- Indicates if the site is a OneDrive ,BlockAccessFromUnmanagedDevices BIT -- Site blocks access from unmanaged devices ,BlockDownloadOfAllFilesOnUnmanagedDevices BIT -- Site blocks download of all files from unmanaged devices ,BlockDownloadOfViewableFilesOnUnmanagedDevices BIT -- Site blocks download of viewable files from unmanaged devices ,ShareByEmailEnabled BIT -- Site enables share by e-mail ,ShareByLinkEnabled BIT -- Site enables share by link ,IsExternalSharingEnabled BIT -- Indicates if external sharing is enabled ,SiteConnectedToPrivateGroup BIT -- Indicates if site is connected to a Private Group ,Privacy NVARCHAR(50) -- Privacy of the site: Private or Public ,SensitivityLabelInfo_Id NVARCHAR(255) -- Id of the Sensitivity Label for the site ,SensitivityLabelInfo_DisplayName NVARCHAR(255) -- Display name of the Sensitivity Label for the site ,Classification NVARCHAR(255) -- Classification of the site ,IBMode NVARCHAR(50) -- Information Barriers Mode ,IBSegments NVARCHAR(MAX) -- List of organization segments if IB mode is Explicit ,Owner_AadObjectId NVARCHAR(255) -- AAD Object Id of the owner ,Owner_Email NVARCHAR(320) -- Email of the owner ,Owner_UPN NVARCHAR(320) -- User Principal Name of the owner ,Owner_Name NVARCHAR(255) -- Name of the owner ,SecondaryContact_AadObjectId NVARCHAR(255) -- AAD Object Id of the secondary contact ,SecondaryContact_Email NVARCHAR(320) -- Email of the secondary contact ,SecondaryContact_UPN NVARCHAR(320) -- User Principal Name of the secondary contact ,SecondaryContact_Name NVARCHAR(255) -- Name of the secondary contact ,ReadLocked BIT -- Whether the site is locked for read access ,ReadOnly BIT -- Whether the site is in read-only mode ,CreatedTime DATETIME -- When the site was created (UTC) ,LastSecurityModifiedDate DATETIME -- When security on the site was last changed (UTC) ,LastUserAccessDate DATETIME -- Last access by a real user (UTC) ,LastContentChange DATETIME -- When the site contents were last changed (UTC) ,SnapshotDate DATETIME -- When this site information was captured (UTC) ,Operation NVARCHAR(50) -- Extraction mode of this row ); Scenario 2 Now I use the CREATE TABLE as the base to ask for a SELECT statement. Me: Give me a list of sites that have more than 5GB of files. Give me a single SQL statement using the tables described below. <create table statements included here> Copilot: SELECT Id AS SiteId, Url AS SiteUrl, RootWeb_Title AS SiteTitle, StorageMetrics_TotalSize AS TotalSizeInBytes, CAST(StorageMetrics_TotalSize / (1024.0 * 1024 * 1024) AS DECIMAL(10,2)) AS TotalSizeInGB FROM SharePointSites WHERE StorageMetrics_TotalSize > (5 * 1024 * 1024 * 1024) ORDER BY StorageMetrics_TotalSize DESC; Note: Interesting choice of columns and order, without any specific prompting to that effect. Scenario 3 Now for a more challenging scenario using two tables and a more complex requirement. Me: Given the tables below, write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. <create table statements for Sites and Files included here> Copilot: WITH ExtensionTotals AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalSize FROM SharePointFiles f WHERE f.Extension IS NOT NULL GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT et.SiteId, et.Extension, et.TotalSize, ROW_NUMBER() OVER (PARTITION BY et.SiteId ORDER BY et.TotalSize DESC) AS rn FROM ExtensionTotals et ), TopExtensions AS ( SELECT SiteId, STRING_AGG(Extension + ' (' + CAST(TotalSize AS VARCHAR(20)) + ')', ', ') WITHIN GROUP (ORDER BY TotalSize DESC) AS Top3Extensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageSize, te.Top3Extensions FROM SharePointSites s LEFT JOIN TopExtensions te ON s.Id = te.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; Notes: Scenario is nicely handled, using both the Files and Sites tables. Complex aggregation is coded correctly. Scenario 4 In this one, I skip the step of using a SQL CREATE TABLE statement and go directly from the markdown for the two schemas (Sites and Files) to a SQL SELECT statement. I also let GPT-5 add his comments and notes this time, to make sure it's thinking correctly... Me: Consider a SQL table for SharePoint Sites based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointsites.md - This dataset is flattened. There is one column in the SQL table for every row, except those of type object. Column names are flattened with a prefix. Column names use an underline to separate the prefix. Also consider a SQL table for SharePoint Files based on the schema at https://github.com/microsoftgraph/dataconnect-solutions/blob/main/Datasets/data-connect-dataset-sharepointfiles.md - Use the same assumptions as the previous one. Now write a SELECT statement to give me a list of sites with the name of the site, the total storage size and the top 3 file extensions by size. Copilot: Got it — you want to join the flattened SharePoint Sites table with the flattened SharePoint Files table, aggregate storage usage, and then find the top 3 file extensions by size for each site. Here’s a SQL example that works with the flattened column naming convention you described: WITH FileSizeByExtension AS ( SELECT f.SiteId, f.Extension, SUM(f.SizeInBytes) AS TotalExtensionSize FROM SharePointFiles f GROUP BY f.SiteId, f.Extension ), RankedExtensions AS ( SELECT SiteId, Extension, TotalExtensionSize, ROW_NUMBER() OVER (PARTITION BY SiteId ORDER BY TotalExtensionSize DESC) AS rn FROM FileSizeByExtension ), Top3Extensions AS ( SELECT SiteId, STRING_AGG(Extension, ', ' ORDER BY TotalExtensionSize DESC) AS TopExtensions FROM RankedExtensions WHERE rn <= 3 GROUP BY SiteId ) SELECT s.RootWeb_Title AS SiteName, s.StorageMetrics_TotalSize AS TotalStorageBytes, t.TopExtensions FROM SharePointSites s LEFT JOIN Top3Extensions t ON s.Id = t.SiteId ORDER BY s.StorageMetrics_TotalSize DESC; How it works FileSizeByExtension: Groups files by SiteId and Extension to sum their sizes. RankedExtensions: Uses ROW_NUMBER() to rank extensions by size within each site. Top3Extensions: Filters to the top 3 per site and concatenates them into a comma-separated list. Final SELECT: Joins back to the Sites table to get the site name and total storage size. Notes RootWeb_Title comes from flattening the RootWeb object in the Sites dataset. StorageMetrics_TotalSize is the flattened column for total site storage (excluding recycle bin). STRING_AGG works in SQL Server 2017+, Azure SQL, and some other RDBMS. If your SQL dialect doesn’t support it, you’d need a different aggregation method. This query orders sites by total storage size so the largest sites appear first. Note: Nicely done in one prompt with the proper explanations. Conclusion I hope these examples help you see the potential of getting Copilot's help with your queries. This is one more reason to load your MGDC for SharePoint tables directly into Microsoft Fabric, where you can run SQL queries against tables in the OneLake...Understanding the Notebooks in the Oversharing Template v2 (Microsoft Fabric)
Introduction The Microsoft Graph Data Connect for SharePoint team published two notebooks used with Microsoft Fabric in the Information Oversharing v2 template. This blog explains what each code block inside these notebooks does. The goal was to help you understand what the notebooks do. Note that this document was written with help from Copilot, using simple prompts like “Analyze each section of this Jupyter notebook with PySpark and Scala code. Describe what each section does.” Notebook 1: Read Last Snapshot Dates This first notebook runs right as the pipeline starts. It checks the environment, verifies if the Sites and Permission tables exist in the Lakehouse, checks the last day data was gathered from MGDC and calculates the start and end date to use. It also cleans the staging tables and stores a few commands that are used in later steps. Section 0 – Set the Default Lakehouse for Notebook to Run %%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } } This section uses the %%configure magic command to set a JSON configuration that defines a parameter (lakehouseName) with the default value "defaultlakehousename". This setting ensures that when the notebook is launched through a pipeline, it dynamically selects the target Lakehouse. Section 1 – Initialize Parameters import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar import java.sql.Timestamp val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = "IMAXDefault" val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" val endTime = "2024-11-15T00:00:00Z" spark.conf.set("spark.sql.caseSensitive", true) This section imports various libraries for date/time handling and initializes key parameters for the ETL process. These include a run identifier (runId), workspace and Lakehouse information (with some values coming from Spark configuration), table names for staging and final datasets, and a fallback endTime. It also enforces case sensitivity in Spark SQL. Section 2 – Checking Required Final Tables Exist or Not val lakehouse = mssparkutils.lakehouse.get(lakehouseName) val lakehouseId = lakehouse.id val workspaceName = notebookutils.runtime.context("currentWorkspaceName") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val sitesStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesStagingTableName}" val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val tables = spark.catalog.listTables() val siteTableCount = tables.filter(col("name") === lit(sitesFinalTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsTableCount = tables.filter(col("name") === lit(permissionsFinalTableName) and array_contains(col("namespace"), lakehouseName)).count() val siteStagingTableCount = tables.filter(col("name") === lit(sitesStagingTableName) and array_contains(col("namespace"), lakehouseName) ).count() val permissionsStagingTableCount = tables.filter(col("name") === lit(permissionsStagingTableName) and array_contains(col("namespace"), lakehouseName)).count() This section retrieves the Lakehouse object and uses it to construct ABFS paths for both staging and final tables (for Sites and Permissions). It then checks for the existence of these tables by listing them in Spark’s catalog and filtering by name and namespace. Section 3 – Getting Snapshot Dates from Last Successful Extracts import org.apache.spark.sql.functions.{col, _} import org.apache.spark.sql.types._ import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val dtCurrentDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.S") val dtRequiredtDateFormatt = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss'Z'") var siteDataExists: Boolean = false var permissionsDataExists: Boolean = false val siteSnapshotDate = { if (siteTableCount == 1) { val dfSites = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${sitesFinalTableName} ") val rowSites: Row = dfSites.select("SnapshotDate").head(1)(0) if (rowSites.get(0) == null) endTime else { siteDataExists = true println(s"Sites data Exists: ${siteDataExists}") LocalDateTime.parse(rowSites.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } } val permissionsSnapshotDate = { if (permissionsTableCount == 1) { val dfPermissions = spark.sql(s"SELECT MAX(SnapshotDate) AS SnapshotDate FROM ${lakehouseName}.${permissionsFinalTableName} ") val rowPermissions: Row = dfPermissions.select("SnapshotDate").head(1)(0) if (rowPermissions.get(0) == null) endTime else { permissionsDataExists = true println(s"Permissions data Exists: ${permissionsDataExists}") LocalDateTime.parse(rowPermissions.get(0).toString(), dtCurrentDateFormatt) .format(dtRequiredtDateFormatt) } } else { endTime } } This section queries the final tables to retrieve the latest SnapshotDate for both Sites and Permissions. It then reformats the date into an ISO-compliant format. If no snapshot date is found, it defaults to the predefined endTime, and two boolean flags (siteDataExists and permissionsDataExists) are toggled accordingly. Section 4 – Generate View Script for Sites val sitesView: String = s""" CREATE OR ALTER VIEW vw${sitesFinalTableName} AS SELECT *, [StorageQuotaFriendly] = (case when StorageQuota < 1048576 then concat(ceiling(StorageQuota / 1024.0), ' KB') when StorageQuota < 1073741824 then concat(ceiling(StorageQuota / 1048576.0), ' MB') when StorageQuota < 1099511627776 then concat(ceiling(StorageQuota / 1073741824.0), ' GB') when StorageQuota < 1125899906842624 then concat(ceiling(StorageQuota / 1099511627776.0), ' TB') else concat(ceiling(StorageQuota / 1125899906842624.0), ' PB') end ), [StorageUsedFriendly] = (case when StorageUsed < 1048576 then concat(ceiling(StorageUsed / 1024.0), ' KB') when StorageUsed < 1073741824 then concat(ceiling(StorageUsed / 1048576.0), ' MB') when StorageUsed < 1099511627776 then concat(ceiling(StorageUsed / 1073741824.0), ' GB') when StorageUsed < 1125899906842624 then concat(ceiling(StorageUsed / 1099511627776.0), ' TB') else concat(ceiling(StorageUsed / 1125899906842624.0), ' PB') end ) FROM ${sitesFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(sitesView) Here a SQL view (vwSites) is dynamically generated for the Sites final table. The view adds two computed columns (StorageQuotaFriendly and StorageUsedFriendly) that convert byte values into more digestible units such as KB, MB, GB, etc. This script will be stored and executed later. Section 5 – Generate View Script for Permissions val permissionsView: String = s""" CREATE OR ALTER VIEW vw${permissionsFinalTableName} AS SELECT *, ShareeDomain = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 AND CHARINDEX('.', SharedWith_Email) > 0 THEN SUBSTRING(SharedWith_Email,CHARINDEX('@', SharedWith_Email)+1,LEN(SharedWith_Email)) ELSE '' END, ShareeEMail = CASE WHEN CHARINDEX('@', SharedWith_Email) > 0 THEN SharedWith_Email ELSE '' END, PermissionsUniqueKey = CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000')), EEEUPermissionsCount = SUM(CASE WHEN SharedWith_Name LIKE 'Everyone except external users' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), ExternalUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'External' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ), B2BUserCount = SUM(CASE WHEN SharedWith_TypeV2 LIKE 'B2BUser' THEN 1 ELSE NULL END ) OVER( PARTITION BY CONCAT(SiteId,'_',RoleDefinition,'_',ScopeId,'_',COALESCE(LinkId,'00000000-0000-0000-0000-000000000000'),SharedWith_Name) ) FROM ${permissionsFinalTableName} """.stripMargin.replaceAll("[\n\r]"," ") println(permissionsView) This section builds a SQL view (vwPermissions) for the Permissions final table. It derives additional columns like ShareeDomain, ShareeEMail, and a composite key (PermissionsUniqueKey) while applying window functions to compute counts (e.g., for external or B2B users). This script will also be stored and executed later. Section 6 – Truncate the Staging Tables from Previous Runs if (siteStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${sitesStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${sitesStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${sitesFinalTableName} not found") } if (permissionsStagingTableCount == 1) { spark.sql(s"DELETE FROM ${lakehouseName}.${permissionsStagingTableName} ") println(s"Staging table deleted: ${lakehouseName}.${permissionsStagingTableName}") } else { println(s"Staging table ${lakehouseName}.${permissionsStagingTableName} not found") } This section checks if the staging tables exist (by count) and, if found, issues a SQL DELETE command to remove existing data so that new data can be loaded. It prints messages indicating the action taken. Section 7 – Return Snapshot Dates Back to Pipeline import mssparkutils.notebook val returnData = s"""{\"LakehouseId\": \"${lakehouseId}\", \"SitesStagingTableName\": \"${sitesStagingTableName}\", \"SitesFinalTableName\": \"${sitesFinalTableName}\", \"SitesSnapshotDate\": \"${siteSnapshotDate}\", \"SitesDataExists\": ${siteDataExists}, \"SitesView\": \"${sitesView}\", \"PermissionsStagingTableName\": \"${permissionsStagingTableName}\", \"PermissionsFinalTableName\": \"${permissionsFinalTableName}\", \"PermissionsSnapshotDate\": \"${permissionsSnapshotDate}\", \"EndSnapshotDate\": \"${endTime}\", \"PermissionsDataExists\": ${permissionsDataExists}, \"PermissionsView\": \"${permissionsView}\"}""" println(returnData) mssparkutils.notebook.exit(returnData) This concluding section aggregates the key metadata—including Lakehouse information, table names, snapshot dates, existence flags, and the generated view scripts—into a JSON string. It then exits the notebook by returning that JSON to the pipeline. Notebook 2: Merge Sites and Permissions to Final Table This notebook runs after the Sites and Permissions data from MGDC has been collected successfully into the staging tables. If this is the first collection, it handles them as full datasets, storing the data directly in the final tables. If this is using MGDC for SharePoint delta datasets, it merges the new, updated or deleted objects from the staging tables into the final tables. Note: The word "Delta" here might refer to Delta Parquet (an efficient data storage format used by tables in a Microsoft Fabric Lakehouse) or to the MGDC for SharePoint Delta datasets (how MGDC can return only the objects that are new, updated or deleted between two dates). It can be a bit confusing, so be aware of the two interpretations of the word. Section 0 – Set the Default Lakehouse for Notebook to Run %%configure { "defaultLakehouse": { "name": { "parameterName": "lakehouseName", "defaultValue": "defaultlakehousename" } } } This section uses the same Lakehouse configuration as in Notebook 1. It sets the default Lakehouse through a parameter (lakehouseName) to support dynamic running of the notebook in different environments. Section 1 – Initialize Parameters import java.time.LocalDateTime import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.UUID import java.text.SimpleDateFormat import java.time.{LocalDate, LocalDateTime, Period} import java.time.format.DateTimeFormatter import java.time.temporal.ChronoUnit import java.util.Calendar val runId = "00000000-0000-0000-0000-000000000000" val workspaceId = spark.conf.get("trident.workspace.id") val workspaceName = "LakeHouseTesting" val lakehouseId = spark.conf.get("trident.lakehouse.id") val lakehouseName = spark.conf.get("trident.lakehouse.name") val sitesStagingTableName = "Sites_Staging" val sitesFinalTableName = "Sites" val permissionsStagingTableName = "Permissions_Staging" val permissionsFinalTableName = "Permissions" spark.conf.set("spark.sql.caseSensitive", true) This section is like Notebook 1’s Section 1 but here lakehouseName is retrieved from the configuration. It initializes variables needed for merging, such as run IDs, workspace/Lakehouse identifiers, and table names. Section 2 – Read Sites Dataset from Staging Table val lakehouse = mssparkutils.lakehouse.get(lakehouseName) val lakehouseId = lakehouse.id val workspaceName = notebookutils.runtime.context("currentWorkspaceName") println("Started reading Sites dataset") val sitesStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesStagingTableName}" val dfSitesStaging = spark.read.format("delta").load(sitesStagingLocation) println("Completed reading Sites dataset") This section constructs the ABFS path for the Sites staging table and reads the dataset into a DataFrame using the Delta Parquet format. It includes print statements to track progress. Section 3 – Read Permissions Dataset from Staging Table println("Started reading Permissions dataset") val permissionsStagingLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsStagingTableName}" val dfPermissionsStaging = spark.read.format("delta").load(permissionsStagingLocation) println("Completed reading Permissions dataset") This section performs the analogous operation for the Permissions staging table, loading the dataset into a DataFrame and providing console output for monitoring. Section 4 – Check Final Tables Exist or Not import io.delta.tables.DeltaTable val sitesFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${sitesFinalTableName}" val permissionsFinalLocation = s"abfss://${workspaceName}@onelake.dfs.fabric.microsoft.com/${lakehouseName}.Lakehouse/Tables/${permissionsFinalTableName}" val sitesFinalTableExists = DeltaTable.isDeltaTable(spark, sitesFinalLocation) if (!sitesFinalTableExists) { println("Final Sites table not exists. Creating final Sites table with schema only") dfSitesStaging.filter("1=2").write.format("delta").mode("overwrite").save(sitesFinalLocation) println("Final Sites table created") } else { println("Final Sites table exists already") } val permissionsFinalTableExists = DeltaTable.isDeltaTable(spark, permissionsFinalLocation) if (!permissionsFinalTableExists) { println("Final Permissions table not exists. Creating final Permissions table with schema only") dfPermissionsStaging.filter("1=2").write.format("delta").mode("overwrite").save(permissionsFinalLocation) println("Final Permissions table created") } else { println("Final Permissions table exists already") } This section checks whether the final tables for Sites and Permissions exist. If a table does not exist, it creates an empty table (schema only) from the staging DataFrame by filtering out data (filter("1=2")). Section 5 – Merge Sites Data from Staging Table to Final Table import io.delta.tables._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.{Window, WindowSpec} import org.apache.spark.sql.functions.{coalesce, lit, sum, col, _} import org.apache.spark.sql.types.{StructField, _} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val deltaTableSource = DeltaTable.forPath(spark, sitesStagingLocation) val deltaTableTarget = DeltaTable.forPath(spark, sitesFinalLocation) import spark.implicits._ val dfSource = deltaTableSource.toDF //Delete records that have Operation as Deleted println("Merging Sites dataset from current staging table") deltaTableTarget .as("target") .merge( dfSource.as("source"), "source.Id = target.Id") .whenMatched("source. Operation = 'Deleted'") .delete() .whenMatched("source.Operation != 'Deleted'") .updateAll() .whenNotMatched("source.Operation != 'Deleted'") .insertAll() .execute() println("Merging of Sites dataset completed") This section performs a Delta Lake merge (upsert) operation on the Sites data. The merge logic deletes matching records when the source’s Operation is 'Deleted', updates other matching records, and inserts new records that are not marked as 'Deleted'. Section 6 – Merge Permissions Data from Staging Table to Final Table import io.delta.tables._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.{Window, WindowSpec} import org.apache.spark.sql.functions.{coalesce, lit, sum, col, _} import org.apache.spark.sql.types.{StructField, _} import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.storage.StorageLevel val deltaTablePermissionsSource = DeltaTable.forPath(spark, permissionsStagingLocation) val deltaTablePermissionsTarget = DeltaTable.forPath(spark, permissionsFinalLocation) import spark.implicits._ val dfPermissionsSource = deltaTablePermissionsSource.toDF //Delete records that have Operation as Deleted println("Merging Permissions dataset from current staging table") deltaTablePermissionsTarget .as("target") .merge( dfPermissionsSource.as("source"), """source.SiteId = target.SiteId and source.ScopeId = target.ScopeId and source.LinkId = target.LinkId and source.RoleDefinition = target.RoleDefinition and coalesce(source.SharedWith_Name,"") = coalesce(target.SharedWith_Name,"") and coalesce(source.SharedWith_TypeV2,"") = coalesce(target.SharedWith_TypeV2,"") and coalesce(source.SharedWith_Email,"") = coalesce(target.SharedWith_Email,"") and coalesce(source.SharedWith_AADObjectId,"") = coalesce(target.SharedWith_AADObjectId,"") """) .whenMatched("source. Operation = 'Deleted'") .delete() .whenMatched("source.Operation != 'Deleted'") .updateAll() .whenNotMatched("source.Operation != 'Deleted'") .insertAll() .execute() println("Merging of Permissions dataset completed") This section performs a merge operation on the Permissions data. The merge condition is more complex—comparing multiple columns (including handling nulls with coalesce) to identify matching records. The operation applies deletion for rows marked as 'Deleted', updates others, and inserts records where no match exists. Section 7 – Read and Display Sample TOP 10 Rows var sqlQuery = s"SELECT * FROM ${lakehouseName}.${sitesFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfSitesAll = spark.sql(sqlQuery) display(dfSitesAll) sqlQuery = s"SELECT * FROM ${lakehouseName}.${permissionsFinalTableName} order by SnapshotDate DESC LIMIT 10" val dfPermissionsAll = spark.sql(sqlQuery) display(dfPermissionsAll) This final section executes SQL queries to retrieve and display the top 10 rows from both the Sites and Permissions final tables. The rows are ordered by SnapshotDate in descending order. This is typically used for sample or debugging purposes. Conclusion I hope this article helped you understand the notebooks included in the template. This might help you customize it later. These templates are intended as starting points for your work with many scenarios. Read more about MGDC for SharePoint at https://aka.ms/SharePointData.MGDC for SharePoint FAQ: How to flatten datasets for SQL or Fabric
When you get your data from Microsoft Graph Data Connect (MGDC), you will typically get that data as a collection of JSON objects in an Azure Data Lake Storage (ADLS) Gen2 storage account. For those handling large datasets, it might be useful to move the data to a SQL Server or to OneLake (lakehouse). In those cases, you might need to flatten the datasets. This post describes how to do that. If you’re not familiar with MGDC for SharePoint, start with https://aka.ms/SharePointData. 1. Flattening Most of the MGDC for SharePoint datasets come with nested objects. That means that a certain object has other objects inside it. For instance, if you have a SharePoint Groups object, it might have multiple Group Members inside. If you have a SharePoint Permissions object, you could have many Permissions Recipients (also known as Sharees). For each SharePoint File object, you will have a single Author object inside. When you convert the datasets from JSON to other formats, it is possible that these other formats require (or perform better) if you don’t have any objects inside objects. To overcome that, you can turn those child objects into properties of the parent object. For instance, instead of having the File object with an Author object inside, you can have multiple author-related columns. For instance, you could have Author.Name and Author.Email as properties of the flattened File object. 2. Nested Objects You can get the full list of SharePoint datasets in MGDC at https://aka.ms/SharePointDatasets. Here is a table with a list of objects and their nested objects: Object How many? Primary Key Nested Object How many? Add to Primary Key Sites 1 per Site Id RootWeb 1 per Site Sites 1 per Site Id StorageMetrics 1 per Site Sites 1 per Site Id SensitivityLabelInfo 1 per Site Sites 1 per Site Id Owner 1 per Site Sites 1 per Site Id SecondaryContact 1 per Site Groups 1 per Group SiteId + GroupId Owner 1 per Group Groups 1 per Group SiteId + GroupId Members 1 per Member COALESCE(AADObjectId, Email, Name) Permissions 1 per Permission SiteId + ScopeId + RoleDefintion + LinkId SharedWithCount 1 per Recipient Type Type Permissions 1 per Permission SiteId + ScopeId + RoleDefintion + LinkId SharedWith 1 per Recipient or Sharee COALESCE(AADObjectId, Email, Name) Files 1 per File SiteId + WebId + ListId + ItemId Author 1 per File Files 1 per File SiteId + WebId + ListId + ItemId ModifiedBy 1 per File When you flatten a dataset and there is an object with multiple objects inside (like Group Members or Permission Recipients), the number of rows will increase. You also need to add to primary key to keep it unique. Also note that the File Actions, Sync Health and Sync Errors datasets do not have any nested objects. 3. One Object per Parent When the nested object has only one instance, things are simple. As we described for the Author nested object inside the File object, you promote the properties of the nested object to be properties of the parent object. This is because the Author is defined as the user that initially created the file. There is always one and only one Author. This can happen even happen multiple times for the same object. The File also has a ModifiedBy property. That is the single user that last changed the file. In that case, there is also only one ModifiedBy per File. The Site object also includes several properties in this style, like RootWeb, StorageMetrics, SensitivityLabelInfo, Owner and SecondaryContact. Note that, in the context of the Site object, there is only one owner. Actually two, but that second one is tracked in a separate object called SecondaryContact which is effectively the secondary owner. 4. Multiple Objects per Parent The SharePoint Permissions dataset has a special condition that might create trouble for flattening. There are two sets of nested objects with multiple objects each: SharedWith and SharedWithCount. SharedWith has the list of Recipients and SharedWithCount has a list of Recipient Types. If you just let the tools flatten it, you will end up a cross join of the two. As an example, if you have 4 recipients in an object and 2 types of recipients (internal users and external users, for instance) you will end up with 20 objects in the flattened dataset instead of the expected 10 objects (one per recipient). To avoid this, in this specific condition, I would recommend just excluding the SharedWithCount column from the object before flattening. 5. Conclusion I hope this clarifies how you can flatten the MGDC for SharePoint datasets, particularly SharePoint Permissions dataset. For further details about the MGDC for SharePoint, https://aka.ms/SharePointData.