1. Overview
When gathering SharePoint data through Microsoft Graph Data Connect in a large tenant, you might be pulling many thousands, millions or even billions of objects. If you want to estimate the number of objects in the results, there is a mechanism for that.
You can use the option to sample a dataset, which returns a limited number of objects plus a count of all objects in the dataset. For instance, you can pull just the top 100 objects to get started with a large tenant. This way you can have some examples of the objects without having to pull the entire dataset. This will also allow you to know the total count of objects without pulling all the data.
In this blog, we’ll investigate how to sample a dataset in Microsoft Graph Data Connect for SharePoint.
2. What is sampling?
When you request data from the Microsoft Graph Data Connect for SharePoint, you get the full dataset. For instance, if you request the Sites dataset, you get all the sites in the tenant. This is described in the yellow paths below.
If you provide a "top x" parameter in the request, you can get a sample dataset. You could, for instance, request a "top 1000" to get only up to 1000 objects from that dataset. This is shown by the green paths above.
In short, sampling delivers a partial dataset based on the top number you specified in the request.
The main uses for filtering include:
- Reducing the size of the transfers when you're just inspecting the schema.
- Find out the total number of objects in the dataset without pulling all the data.
Sampling is available only in these specific SharePoint datasets:
- SharePoint Sites
- SharePoint Permissions
- SharePoint Group
- SharePoint Files (coming soon)
- SharePoint File Actions (coming soon)
3. Including the "top x" parameter in the request
To request a sample in your request, you must specify a "top" property in the JSON representation of the request. Start by selecting the Copy Data task you are using and going to the "Source" tab of the Copy Task properties.
Click on the “{ }” button on the top right (indicated above with a red arrow), which brings up the JSON definition for the request.
In the JSON definition, find the "source" section under "typeProperties", where you can find the "dateFilterColumn", the "startTime" and the "endTime" properties. Add a new "AdditionalDataSetProperties" with a “top” property with the number of rows you want to sample. Save the changes to the JSON and make sure to publish the pipeline to apply the changes.
In the screenshot above, you see the definition indicated by the arrow:
>> activity >> typeProperties >> source >> AdditionalDataSetProperties >> top >> 10
Note: After you save the JSON config, the properties might be ordered differently and the "AdditionalDataSetProperties" might show further down the document, after the list of output columns.
Be very careful. If you specify the filtering expression in the wrong place in the JSON, the request may fail, or it could just run without filtering anything from the dataset. If you’re running this in a large production tenant, it is recommended to try this first in a small test or dev tenant.
4. More Request Metadata
As a result of this extra property in your request, the results will be sampled, and you will see fewer rows/objects in your output. You can also look at the metadata file that is produced with any Microsoft Graph Data Connect run to find the number of objects. Look for a folder called “metadata” in the same place you assigned to receive the data (your Azure account container and folder).
In the job metadata file associated with this request, you will see a few additional properties:
- top – Shows the size of the sampling that was requested.
- NumberOfRowsExtracted – Shows the number of rows returned with this request, after the sampling was applied. This number could be the same as the "top" value or smaller.
- TotalRowCount – Shows the total number of rows in the dataset.
Here’s a sample of a job metadata file:
{
"CopyActivityId": "00000000-0000-0000-0000-000000000000",
"JobSubmissionTime": "2024-02-29T20:40:09Z",
"JobCompletionTime": "2024-02-29T20:45:49Z",
"RequestStartDate": "2024-02-24T00:00:00Z",
"RequestEndDate": "2024-02-24T00:00:00Z",
"ColumnsRequested": "ptenant,Id,Url,RootWeb,WebCount,StorageQuota,StorageUsed,StorageMetrics,GroupId,GeoLocation,IsInRecycleBin,IsTeamsConnectedSite,IsTeamsChannelSite,TeamsChannelType,IsHubSite,HubSiteId,BlockAccessFromUnmanagedDevices,BlockDownloadOfAllFilesOnUnmanagedDevices,BlockDownloadOfViewableFilesOnUnmanagedDevices,ShareByEmailEnabled,ShareByLinkEnabled,SensitivityLabelInfo,Classification,IBMode,IBSegments,Owner,SecondaryContact,ReadLocked,ReadOnly,CreatedTime,LastSecurityModifiedDate,Operation,SnapshotDate",
"ExtractionMode": "Full",
"IsFilterApplied": false,
"Filter": "",
"NumberOfRowsExtracted": 10,
"TableName": "BasicDataSet_v0.SharePointSites_v1",
"ApplicationId": "00000000-0000-0000-0000-000000000000",
"OfficeGeo": "NAM",
"DataFactoryName": "mgdc-synapse",
"top": 10,
"TotalRowCount": 128
}
5. Combining Top with Deltas
It is possible to combine the sampling and the delta features of the SharePoint datasets in Microsoft Graph Data Connect.
In that case, when you ask for sampling, the differences between the two dates will be calculated first and then the dataset will be sampled. The total count you see in the metadata file will be the total included in the delta.
You cannot combine sampling and filtering at the source in the same request.
For more details about Delta datasets, see this blog post: MGDC for SharePoint FAQ: How can I use Delta State Datasets?
6. Top Errors
If you ask for sampling and there is a problem with it, Microsoft Graph Data Connect for SharePoint will fail the request and no data will be returned. An error message will be returned so you can understand what happened.
Here are some of the common error conditions:
Cannot filter and sample at the same time. If you specify both a DataFilter and the Top properties, you will get an error like this:
Operation on target Copy_Sites failed:
ErrorCode=UserErrorOffice365DataLoaderError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading failed to execute.
office365LoadErrorType: PermanentError.
Not Supported: Both DataFilter and AdditionalDataSetProperties with 'top' can not be specified at the same time,
Source=Microsoft.DataTransfer.ClientLibrary,'
7. Conclusion
I hope this blog post will help you get started with Sampling. You now also have a mechanism to find out the number of rows in a dataset without pulling the entire dataset.
For more information about Microsoft Graph Data Connect for SharePoint, please visit the collection of links I keep at Links about SharePoint on Microsoft Graph Data Connect.