1. Overview
When gathering SharePoint data through Microsoft Graph Data Connect in a large tenant, you might be pulling many thousands, millions or even billions of objects. If you want to get just a subset of the results, there is a mechanism to filter the results at the source.
A common example is to pull the data just for one site (filtering for the site id) to get started with a large tenant. This way your results will be smaller, cheaper, and easier to handle. You can also use this to get just specific types of sites, like OneDrive or Communication sites.
In this blog, we’ll investigate how you can apply a filter to a dataset in Microsoft Graph Data Connect for SharePoint.
2. What is filtering?
When you request data from the Microsoft Graph Data Connect for SharePoint, you get the full dataset. For instance, if you request the Sites dataset, you get all the sites in the tenant. This is described in the yellow paths below.
If you provide a filter with the request, you can get a partial dataset based on a filter expression. You could, for instance, request all sites with a template id of 21 (that means a OneDrive). This is shown by the green paths above.
In short, filtering delivers a partial dataset based on a filter expression specified in the request. Only objects that meet the criteria are delivered.
The main uses for filtering include:
- Reducing the size of the transfers
- Excluding sensitive data from the results
Filtering is available only in these specific SharePoint datasets:
- SharePoint Sites
- SharePoint Permissions
- SharePoint Group
- SharePoint Files (coming soon)
- SharePoint File Actions
Note: Microsoft Graph Data Connect also offers an option to exclude specific columns from the results, which is another mechanism to protect sensitive data.
3. Datasets and Columns
This filtering feature applies only to the SharePoint datasets listed below. You can filter by Site Id in all SharePoint datasets. You can also filter by Template Id in the Sites, Files and File Actions datasets:
Dataset |
Site Id column |
Template Id column |
Sites |
Id |
RootWeb.WebTemplateId |
Permissions |
SiteId |
N/A |
Group |
SiteId |
N/A |
Files |
SiteId |
WebTemplateId |
File Actions |
SiteId |
WebTemplateId |
For WebTemplateId, refer to this blog post for details on the codes used:
https://barreto.home.blog/2023/04/17/sharepoint-on-mgdc-faq-is-onedrive-included/
If you're interested in filtering for additional SharePoint dataset columns, please let us know in the comments.
4. Expressions and Operators
Filters are specified using an expression. The expression usually includes a column (as described in the previous section), an operator and a constant.
The main operators supported include:
Type |
Operator |
Description |
Equality |
eq |
Equal |
Equality |
ne |
Not Equal |
Equality |
in |
is IN a list |
Relational |
lt |
Less Than |
Relational |
gt |
Greater Than |
Relational |
le |
Less than or Equal |
Relational |
ge |
Greater than or Equal |
Logical |
not |
Not |
Logical |
and |
And |
Logical |
or |
Or |
Here are a few sample expressions you could use:
Dataset |
Description |
Filtering Expression |
Sites |
Include only a specific site |
Id eq 'a123' |
Sites |
Include all sites except a specific site |
Id ne 'a123' |
Sites |
Include only a specific list of sites |
Id in ('a123', 'b456', 'c789') |
Sites |
Include everything but OneDrive sites |
RootWeb.WebTemplateId ne 21 |
Permissions |
Include only a specific site |
SiteId eq 'a123' |
The expression language is patterned after the Microsoft Graph filter query parameter. You can read more about it at https://learn.microsoft.com/en-us/graph/filter-query-parameter.
5. Including the Filter expression in the request - Synapse/ADF
To add a filter to your request, you must specify a DataFilter property in the JSON representation of the request. In Synapse or Azure Data Factory, start by selecting the Copy Data task you are using in your pipeline and going to the "Source" tab of the Copy Task properties.
Click on the “{ }” button on the top right (indicated above with a red arrow), which brings up the JSON definition for the request.
In the JSON definition, find the "source" section under "typeProperties", where you can find the "dateFilterColumn", the "startTime" and the "endTime" properties. Add a new “DataFilter” property with the expression you want to apply. Save the changes to the JSON and make sure to publish the pipeline to apply the changes.
In the screenshot above, you see the definition indicated by the red arrow:
>> activity >> typeProperties >> source >> DataFilter >> RootWeb.WebTemplateId eq 21
Be very careful. If you specify the filtering expression in the wrong place in the JSON, the request may fail, or it could just run without filtering anything from the dataset. If you’re running this in a large production tenant, it is recommended to try this first in a small test or dev tenant.
6. Filter expression in Microsoft Fabric
You have a similar mechanism to add the DataFilter property for your pipelines in Microsoft Fabric as well. When editing your pipeline, just find the "Edit JSON code" option in the "View" menu.
The JSON in Microsoft Fabric is similar to what you will find in Synapse and ADF. Just make sure to insert the DataFilter property in the right place.
7. More Request Metadata
As a result of this extra property in your request, the results will be filtered, and you will see fewer rows/objects in your output. Another way to see this is looking at the metadata file that is produced with any Microsoft Graph Data Connect run. Look for a folder called “metadata” in the same place you assigned to receive the data (your Azure account container and folder).
In the job metadata file associated with this request, you will see a few additional properties:
- IsFilterApplied – Shows true if the request included a filter expression.
- Filter – The expression passed by in the DataFilter property.
- NumberOfRowsExtracted – Shows the number of rows after the filter was applied.
Here’s a sample of a job metadata file:
{
"CopyActivityId": "00000000-0000-0000-0000-000000000000",
"JobSubmissionTime": "2024-01-10T21:27:18Z",
"JobCompletionTime": "2024-01-10T21:32:44Z",
"RequestStartDate": "2024-01-06T00:00:00Z",
"RequestEndDate": "2024-01-06T00:00:00Z",
"ColumnsRequested":"ptenant, Id, Url, RootWeb, WebCount, StorageQuota, StorageUsed, StorageMetrics, GroupId, GeoLocation, IsInRecycleBin, IsTeamsConnectedSite, IsTeamsChannelSite, TeamsChannelType, IsHubSite, HubSiteId, BlockAccessFromUnmanagedDevices, BlockDownloadOfAllFilesOnUnmanagedDevices, BlockDownloadOfViewableFilesOnUnmanagedDevices, ShareByEmailEnabled, ShareByLinkEnabled, SensitivityLabelInfo, Classification, IBMode, IBSegments, Owner, SecondaryContact, ReadLocked, ReadOnly, CreatedTime, LastSecurityModifiedDate, Operation, SnapshotDate",
"ExtractionMode": "Full",
"IsFilterApplied": true,
"Filter": "RootWeb.WebTemplateId eq 21",
"NumberOfRowsExtracted": 4,
"TableName": "BasicDataSet_v0.SharePointSites_v1",
"ApplicationId": "00000000-0000-0000-0000-000000000000",
"OfficeGeo": "NAM",
"DataFactoryName": "mgdc-synapse"
}
8. Combining Filters with Deltas
It is possible to combine the filtering and the delta features of the SharePoint datasets in Microsoft Graph Data Connect.
When you specify a filtering expression and two different dates, the dataset will be first filtered and then the differences between the two dates will be calculated.
For more details about Delta datasets, see this blog post:
MGDC for SharePoint FAQ: How can I use Delta State Datasets?
9. Filtering Errors
If you pass an expression for filtering and there is a problem with it, Microsoft Graph Data Connect for SharePoint will fail the request, and no data will be returned. An error message will be returned so you can understand what happened.
Here are some of the common error conditions:
- Cannot filter and sample at the same time. If you specify both a DataFilter and the Top properties, you will get an error like this:
Operation on target Copy_Sites failed:
ErrorCode=UserErrorOffice365DataLoaderError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading failed to execute.
office365LoadErrorType: PermanentError.
Not Supported: Both DataFilter and AdditionalDataSetProperties with 'top' can not be specified at the same time,
Source=Microsoft.DataTransfer.ClientLibrary,'
- You specified a column name that is not one the columns supported for that dataset. If you specify a non-existent or unsupported column, you will get an error like this:
Operation on target Copy_Sites failed:
ErrorCode=UserErrorOffice365DataLoaderError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading failed to execute.
office365LoadErrorType: PermanentError.
[1]: Invalid schema specified in filters.
Only few columns are supported for filters.
Supported Columns: ['Id', 'RootWeb.WebTemplateId'].
Usage Example: Id eq '00000000-0000-0000-0000-000000000000'
or RootWeb.WebTemplateId eq 21'
,Source=Microsoft.DataTransfer.ClientLibrary,'
- You specified a bad expression. If you specify a malformed expression in your DataFilter expression, you will get an error.
10. Conclusion
I hope this blog post will help you get started with Filtering. For more information about Microsoft Graph Data Connect for SharePoint, please visit the collection of links I keep at Links about SharePoint on Microsoft Graph Data Connect.