Forum Discussion
Pre-migration queries related to data discovery and file analysis
Hi Team,
A scenario involves migrating approximately 25 TB of data from on‑premises file shares to SharePoint. Before the migration, a discovery phase is required to understand the composition of the data. The goal is to identify file types (Microsoft Office documents, PDFs, images, etc.) without applying any labels at this stage. The discovery requirements include:
- Identification of file types
- Detection of duplicate or redundant files
- Identification of embedded UNC paths, macros, and document links
- Detection of applications running directly from file shares
Guidance is needed on which Microsoft Purview components—such as the on‑premises scanner or the Data Map—can support these discovery requirements. Clarification is also needed on whether Purview is capable of meeting all the above needs.
Clarification is also needed on whether Purview can detect duplicate or redundant files, and if so, which module or capability enables this.
Additionally, since Purview allows downloading only up to 10,000 logs at a time, what would be the best approach to obtain discovery logs for a dataset of this size (25 TB)?
Thank you !
1 Reply
- Prathista Ilango
Microsoft
Hello pallavirajak,
Hope the below details help.
- Identification of file types - The Information Protection Scanner discovery report will have details about the files scanned. To know more about IP Scanner, refer to: Learn about the Microsoft Purview Information Protection scanner | Microsoft Learn
- Detection of duplicate or redundant files -Currently, there is no direct method within Purview to identify duplicate files.
- Identification of embedded UNC paths, macros, and document links - If you mean links or embedded paths mentioned inside documents, this can be approached by creating custom Sensitive Information Types (SITs) to detect UNC paths and hyperlinks. However, identifying macros specifically is not supported by Purview. Attack Surface Reduction rules of Defendercould help with this - Attack surface reduction rules reference - Microsoft Defender for Endpoint | Microsoft Learn
- Detection of applications running directly from file shares- This is not possible with Purview. For this requirement, Microsoft Defender for Endpoint offers better capabilities through Advanced Hunting queries.
Refer to: Overview - Advanced hunting - Microsoft Defender XDR | Microsoft Learn
Microsoft-365-Defender-Hunting-Queries/Discovery/SMB shares discovery.txt at master · microsoft/Microsoft-365-Defender-Hunting-Queries · GitHub
For macros and detecting apps running from shares, Defender for Endpoint would provide more effective solutions than Purview.
On the 10,000‑row export limit, guess you’re referring to portal exports (e.g., Activity Explorer). For IP Scanner discovery, reports are available locally on the scanner host. Refer to: https://learn.microsoft.com/en-us/purview/deploy-scanner-manage#run-a-discovery-cycle-and-view-reports-for-the-scanner
If a long discovery run stops mid‑way, address the root cause first, then resume/partition the scan and consolidate the per‑job CSVs.
When an IP Scanner run stops mid‑scan, refer to: https://learn.microsoft.com/en-us/purview/deploy-scanner#stopped-scanner-processesSettings that can be Modified to Improve Network Performance - BizTalk Server | Microsoft Learn
Please mark as solution, if you find the answer helpful. This will assist others in the community who encounter a similar issue, enabling them to quickly find the solution and benefit from the guidance provided