MGDC for SharePoint FAQ: How can I use Delta State Datasets?
Published Mar 07 2024 12:23 PM 1,266 Views
Microsoft

1. Overview

 

When gathering SharePoint data through Microsoft Graph Data Connect in a large tenant, you might be pulling millions of objects or even hundreds of millions. Also, if you want to always have the latest information available, you will likely run your pipeline daily.

 

In this blog, we’ll investigate how you can use “Delta State Datasets”, a feature that helps you optimize recurring data transfers for SharePoint State datasets to tenants by tracking what has changed since the last data transfer. You can keep up to date with the latest changes and pull a smaller number of objects.

 

2. State Datasets

 

The SharePoint datasets for Data Connect include several State datasets, including Sites, Groups, Sharing Permissions and Files. These datasets show the current situation (state) for these objects. Take the sites dataset, for instance. It includes one object for each site in the tenant, regardless of when it was created or modified.

 

With the regular “full pull” mechanism, if you want to update your tenant’s Sites dataset daily, you will need to get a complete Sites dataset every day. Since most of the sites don’t change so frequently, a lot of what you will get daily will be the same as the day before. This is a waste of resources, since you will need to download, store, and process a lot of repeated data. Also, because you pay by the number of objects pulled, this is also going to reflect in your Azure bill.

 

3. Delta State Datasets

 

The Delta feature offers a mechanism to pull a partial dataset that includes just the objects that changed since a specific date. Using Sites as an example again, a Delta State Dataset would include only the sites that have been created, modified, or deleted between two dates that you specify.

Delta State DatasetsDelta State Datasets

 

Early observations from one of the preview tenants showed that less than 10% of their sites changed daily (this number will vary widely, depending on how busy your tenant is). This means you have the potential to reduce the number of objects transferred daily by an order of magnitude while keeping a complete and up-to-date record of all sites you have in SharePoint.

This will require you to process this Delta State Dataset to update the last full copy of the State dataset in question. You would combine the data in your previous State dataset with the data in the Delta results to produce an updated full State dataset.

 

The advantages of using Delta State Datasets include:

  • Fewer objects transferred daily, resulting in more optimal use of resources.
  • Smaller Azure bill for tenants that want to keep a State Dataset updated frequently.

The disadvantages of using Delta State Datasets include:

  • More complex workflow to keep a State Dataset updated daily.
  • Increased processing (compute) to update a previous State Dataset with a Delta State Dataset.

Note: This Delta feature is only offered with the SharePoint datasets for Sites, Groups and Sharing Permissions (these are all State Datasets).

 

4. Delta State Datasets and Usage

 

Delta State Datasets look like they provide detailed usage data (all actions performed in each object), but that is not correct. They just show the difference between the state for the two dates you specify.

 

For example, imagine a Sites dataset where you pulled the state for 100 sites on 01/01. A Delta State Dataset from 01/01 to 01/03 will have a single object for every site that changed between the two dates. If 5 sites were added, 10 were modified and 2 were deleted, the Delta State Dataset will have 17 objects. After processing, the updated state for 01/03 would include 103 sites.

 

You would not have all the usage details between 01/01 and 01/03. Keep in mind that a single site could have more than one action performed during that period. A site that was updated 20 times between 01/01 and 10/03 will have only 1 object in the Delta State Dataset. It is also possible that a site was created and modified during that day and the Delta will show only that the site was created, showing the latest update.

 

All Data Connect dataset pulls allow you to select which columns you want to include. Keep in mind that the Delta State Dataset will only look at changes to columns that you requested. If you exclude the site owner column from your pull and the site owner is the only thing that changed in the site between those dates, that site will not show up in the Delta Dataset.

 

5. Requesting a Delta Dataset

 

Requesting a delta dataset is simple. Just make sure that start date and end date for the SnapshotDate filter are different. Also make sure that the start date is before the end date. You will receive objects that were created, updated or deleted in that time period. For each object, the “Operation” column will show as “Created”, “Updated” or “Deleted”. If no objects are returned, there were no changes during this period. 

 

Here’s a screenshot of a simple pipeline, with the date filter highlighted:

 

Requesting a Delta DatasetRequesting a Delta Dataset

 

To request a full dataset, you just need to make sure that the start date and end date are the same. You will receive all objects from SharePoint as they existed on that date, regardless of when they were created, updated or deleted. For all objects in a full pull, the “Operation” column will show as “Full”.

 

The “Operation” column is defined in the schema as “Extraction mode of this row. Gives info about row extracted with full mode (‘Full’) or delta mode (‘Created’, ‘Updated’ or ‘Deleted’)”. For a delta pull, you must include the “Operation” column in your output. For a full pull, the “Operation” column can be excluded, since it will always show the same value.

 

Here’s an example of the Groups dataset output columns, as shown in a Synapse pipeline, with the “Operation” column highlighted:

 

The Operation ColumnThe Operation Column

 

Note: If you don’t see the “Operation” column in your dataset, that means that:

  • The dataset is not a SharePoint dataset.
  • The dataset is not a State dataset..

 

6. Frequency of Delta Datasets

 

You must always start with an initial “full pull” and then do “delta pulls” after that. You can do deltas at different frequencies, including daily, every other day, weekly or every other week.

You can’t do monthly deltas because SharePoint datasets in Data Connect don’t go back that far. The dates you specify must be between 23 days ago and 2 days ago. For details, see this blog: MGDC for SharePoint FAQ: Which date should I query?

 

SharePoint datasets on Data Connect are intended for analytics and are only updated daily, so you can’t do delta pulls more frequently than once a day. There is no way to do delta pulls hourly or multiple times a day. In fact, if you specify a time in your start date or end date with a SharePoint dataset, that information will be ignored and only the date portion will be used.

 

7. Merging Delta Datasets

 

After you pulled a delta dataset, you need to combine it with a previous full to create a new, up-to-date full dataset.

 

You basically need to loop through all the objects in the delta and perform the required operation in the full dataset:

  • Created: You need to add this record from the delta to the full
  • Updated: You need to replace this record from the delta in the full
  • Deleted: You need to remove this record from the delta in the full

 

For the Updated and Deleted objects, you will need a unique identifier (one or more columns) to find the exact existing object in your full dataset. Here is what you need for each dataset.

  • Sites: Id
  • Groups: SiteId + GroupId
  • Sharing Permissions: SiteId + ScopeId + RoleDefintion + LinkId

 

This will require some coding, but it is common pattern. If you use Synapse, SQL or similar tools, you can transform an entire dataset at once. For more details on how to process the delta datasets, see this blog: SharePoint on MGDC FAQ: How do I process Deltas?

 

8. Summary

 

In summary, you can use Delta State Datasets for SharePoint in Data Connect to reduce the number of objects you need to pull to keep an up-to-date State Dataset. This feature will help you reduce your use of resources while keeping the latest information always available. I hope this blog post will help you get started with Deltas.

 

For more information about SharePoint Data in Data Connect, please visit the collection of links I keep at Links about SharePoint on MGDC.

Co-Authors
Version history
Last update:
‎Mar 07 2024 03:55 PM
Updated by: