Jun 15 2022 09:47 AM
Are there limitations to array sizes? I'm trying to get the line count of very large .txt files ( so far the largest file is 782MB) and also look for duplicates.
The script works with small files as a test but when I try to run it against a larger file it doesn't finish and doesn't error. I'm monitoring CPU & memory and powershell_ise.exe where I am running it, is using 25% CPU and a memory working set of 880,568. Overall CPU on the VM is running around 27% and memory 50%, so it isn't topping out physical resources.
The script does a few arrays to capture folders & subfolders and check subfolder names and then does counts of the .txt files found in the subfolders. This is a snippet of the section in question.
ForEach ($sCARDHIST in $aCARDHISTFILES)
{
$sCARDHIST1 = $sCARDHIST.FullName
### get line counts in txt ###
$sCARDHISTLINES = (Get-Content $sCARDHIST1).length
### Look for duplicates in file ###
$aDUPS = Get-Content $sCARDHIST1 | group-Object | Where-Object {$_.Count -gt 1} | Select -ExpandProperty Name
If ($aDUPS.count -gt 0 )
{
echo "Duplicates found in $sCARDHIST1" >>$sFLAGFILE
Write-Host ""
Write-Host "================================================="
Write-Host "duplicate records found in $sCARDHIST1"
Write-Host "================================================="
ForEach ($sDUP in $aDUPS)
{
FuncLogWrite "$sDUP"
Write-Host "$sDUP"
}
}
}
Does anyone have information on array limitations or is there a better way in powershell to look for duplicates and line counts?
Jun 15 2022 04:24 PM
There's no real "limitations" as such to the classes themselves.
From a resources perspective, I'd be wary of looking at CPU overall as, for example, a four core machine with one core running at 100% utilisation (likely indicating a problem) will report overall utilisation of 25%.
Given many processes do not run as parallel workloads over multiple cores, this is something to be mindful of.
Windows PowerShell memory limits are reasonably low per shell. You can read more about how to change this (whether it should be changed is a separate consideration) here.
Learn How to Configure PowerShell Memory - Scripting Blog (microsoft.com)
You'd want to be confident you've optimised the script before increasing the per shell limit or else you'll just run into the same issue again at the higher level.
Cheers,
Lain
Jun 20 2022 12:30 AM
First thing I have noticed here is that you are running
Get-Content $sCARDHIST1
twice, you would save time by running this into a variable once and then use that variable for your
$Content = Get-Content $sCARDHIST1
# get the count
$Content.count
# and then the duplicate rows
$Content | Group-Object ...
As the file numbers and sizes grows though you will encounter more and more performance problems, loading 782Mb of a text file into memory and then sorting each row will not be a fast process in PowerShell. Depending on row size that is a _lot_ of rows