Identify Excel files sans extensions

Copper Contributor

A new client handed me a removable hard drive with several gigabytes of files that were recovered from a server that crashed/failed years ago. The files have generic/random names and no extensions. The client wants any/all Excel files from the device copied onto a different device and wants to know the authors of each Excel file.

 

I would like to put together a PS script that opens each file on this device to see if it is an Excel file, and if so, copy it to another location and append an extension to the filename, either XLS or XLSX. I would also like the script to output a CSV with the results of each file checked (i.e. full path of file checked, yes/no for Excel validation, and if Excel validated then last modified by person, etc.).

I suspect there are also Word files, PowerPoint files, and PDFs, but the client doesn't care about any of those (yet?). So, I'm thinking it would be a bonus if the PS script also identified and handled these in the same manner.

 

Thoughts on best way to handle this would be appreciated.

3 Replies
I said either XLS or XLSX but should have been more specific. The owner of the files believes that there's a mix of the older XLS format and the XLSX format, so ideally, the script would identify them and append the appropriate suffix accordingly.

@KevinSawyer 

I created a script for you, did some quick test and it seemed to work but only for XLSX files. XLSX files are in fact archive files so you can open them for example whit 7-zip or even better whit powershell. I will use this possibility to check if there is a specific folder (named xl) inside the (archive)file. If the folder is inside the archive I know pretty sure that we are dealing with an XLSX file. Use this script whit conscience  and first whit a small subset of your data. Use this script at your own risk. The script does the following:

 

Checks (recursively) if there are without extension in the sourcefolder. In sequence copy the files to the processfolder and add the .zip extension. Then expand the archive to the archive folder and check if a folder named xl is there. If it’s there I set type to XLSX, if not I set type to unknow and if the file is to large I set type to file_to_big. Then I add the results to the outputarray and convert that array to a nice CSV output. Hopefully you can use this CSV to plan your next move!

 

Once again be careful using this script. Hope it’s useful to you! Take care!

 

Grtz, Manfred de Laat

 

The script: 

# Use script whit caution!

# Variables
$sourcefolder = "c:\temp"
$processfolder = "c:\temp\process"
$maxfilesize = "50000" # 50 MB


# Get all files in $sourcefolder (and subfolders), whitout extension
$items = Get-ChildItem -Path $sourcefolder -File  | ?{$_.Extension -eq ""}
$outputarray = @()
foreach($item in $items){
    # Create and/or use the processfolder
    if(-NOT(Test-Path $processfolder)){
        New-Item -Path $processfolder -ItemType Directory | Out-Null
    }

    # Create Archive folder to extract possible XLSX file
    $archivefolder = ($processfolder + "\archive")    
    New-Item -ItemType Directory $archivefolder | Out-Null
    
    if($item.LinkType -lt $maxfilesize){
        # Copy and Extract file
        $target = ($processfolder + "\" + $item.BaseName + ".zip")  
        Copy-Item $item.FullName $target
        Expand-Archive $target -DestinationPath $archivefolder -ErrorAction SilentlyContinue
    
        # Check if Archive folder contains a childfolder named "xl"
        if(Test-Path ($archivefolder + "\xl")){
            $type = "XLSX"
        } else {
            $type = "unknown"
        }
    } else {
        $type = "file_to_large"
    }
    $outputarray += New-Object PSOBJECT -Property @{Name = $item.FullName; Type = $type}
    Remove-Item $target -Force
    Remove-Item $archivefolder -Recurse -Force
}
Remove-Item $processfolder -Recurse -Force
$outputarray | ConvertTo-Csv -NoTypeInformation

 

@KevinSawyer How are things going? Did the solution work?