Jul 24 2023 08:04 AM
Hello all,
My M365 tenant has content size of our 1.7 PB with close to 90,000 site collections. My requirement is to fetch the report of content (should crawl through all document libraries in all sites) that is older than 3 years. Report should include document type, filename, size, created, created by, modified and modified by details.
Tried to use ShareGate for this purpose by utilizing custom reports but considering amount of sites/size, the program is moving to non-responding state after few minutes of task kick-off. Tried SharePoint PowerShell but no luck.
Is there any other way that this report can be generated? Any help is greatly appreciated!
Jul 24 2023 09:58 AM
Jul 25 2023 01:09 AM
Sounds like a nice challenge.
I would first try to use the Content search in the Compliance Center as Chris Webb suggested. Interesting to see if it can handle such a large data volume. Your tenant may hold several 100 million documents and depending on the way it is used you may get 10's of million of "stale" documents.
Hopefully there are no "blind spots" in the search index where part of the content has not been indexed.
If the above OOTB method does not work you will need to look at alternative approaches.
We use a node.js application to update metadata for SharePoint documents. It loops over all sites and libraries and selects documents using a CAML query, downloads the documents, extracts the properties (key word, created date within the document, sent date email, ...) and then updates the SharePoint column(s). The key challenges you face are:
- use credentials that have access to all sites
- data volume: you will need to be able to scale out (multiple threads, multiple systems)
- handle list view threshold
- handle throttling
Your case is a bit simpler than our case: you can already stop after executing the CAML query.
In short, use the OOTB features from Compliance Center. If that does not work estimate the effort to perform the task successful and whether it can be justified. Potential drivers are most likely compliance or storage costs.
good luck
Jul 25 2023 03:53 AM
Jul 25 2023 04:36 AM