SharePoint: Report to fetch content older than X amount of time

Iron Contributor

Hello all,

 

My M365 tenant has content size of our 1.7 PB with close to 90,000 site collections. My requirement is to fetch the report of content (should crawl through all document libraries in all sites) that is older than 3 years. Report should include document type, filename, size, created, created by, modified and modified by details.

 

Tried to use ShareGate for this purpose by utilizing custom reports but considering amount of sites/size, the program is moving to non-responding state after few minutes of task kick-off. Tried SharePoint PowerShell but no luck.

 

Is there any other way that this report can be generated? Any help is greatly appreciated!

7 Replies
Should be able to utilize the compliance center and do a content search with Conditions set for dates, Then you can export the "Results" Excel file with all the information. You will need Global admin or elevated role for this obviously but that's the only option I know of that's easy to do.

@ellan1537 

Sounds like a nice challenge.

I would first try to use the Content search in the Compliance Center as Chris Webb suggested. Interesting to see if it can handle such a large data volume. Your tenant may hold several 100 million documents and depending on the way it is used you may get 10's of million of "stale" documents.
Hopefully there are no "blind spots" in the search index where part of the content has not been indexed.

If the above OOTB method does not work you will need to look at alternative approaches.

We use a node.js application to update metadata for SharePoint documents. It loops over all sites and libraries and selects documents using a CAML query, downloads the documents, extracts the properties (key word, created date within the document, sent date email, ...) and then updates the SharePoint column(s). The key challenges you face are:
- use credentials that have access to all sites
- data volume: you will need to be able to scale out (multiple threads, multiple systems)
- handle list view threshold
- handle throttling

Your case is a bit simpler than our case: you can already stop after executing the CAML query.

 

In short, use the OOTB features from Compliance Center. If that does not work estimate the effort to perform the task successful and whether it can be justified. Potential drivers are most likely compliance or storage costs.

good luck

Hi ellan1537,

My thought here is to use Search API to get such info, but not to execute on all. Run script in such a way so that it will run for a site max in memory.
Using search api gives perhaps more control than using Content Search from Compliance Center but requires more knowledge of the search api.
Using search API has pros and cons. I personally think it is great. It can be used in many different applications (PowerShell, csom, node.js, ...) but assumes that the content has been indexed and there are no blind spots plus the user must have at least read access.
Also make sure to iterate over the search results because they will only be returned in sets of 500 items. Also make sure the trimduplicates is set to false.
Search will only return the latest version. All the document versions are not exposed.

@ellan1537 

Were you able to run a successful report? I am looking for a similar report to demonstrate the impact of setting up a retention policy that will delete files older than 5 years old and it is challenging to get data for the whole tenant. Reading through the suggestions for the Compliance Center (now Purview) the audit only allows 180 days for the report range so that's already a big limitation on time. And I don't see anything specific to content for a last modified date. Would love to hear if you were able to get the data successfully.

No direct way of fetching the report. I'm still investigating for solution

We have a 10yr retention policy and would like to warn users when the retention is met with an alert and/or pull a list of what's about to be deleted soon and warn them.

 

Some of the files have started deleting by system account as they retentions been met.