Forum Discussion

richardsmc's avatar
richardsmc
Copper Contributor
Jul 16, 2024
Solved

Help with Using PowerShell to split a file by a particular string and saving as specific name

Good afternoon,  I hope someone can assist.

 

I need assistance being able to split a file into multiple files with a specific name.

 

So  I use get-content to read the file

16/07/2024  MessageA    - 1

***Line 1

***Line 2

*End of Message

 

16/07/2024  MessageB    - 2

***Line 1

***Line 2

*End of Message

 

16/07/2024  MessageC    - 3

***Line 1

***Line 2

*End of Message

 

As you can see there are three messages in the  object when i use get-content.  I wish to split this file into (which in this case would be three messages) once the string *End of Message* is seen.  Also giving the name of each file to be the first line in each of the files.

filename - MessageA.txt

16/07/2024  MessageA    - 1

***Line 1

***Line 2

*End of Message

 

filename - MessageB.txt

16/07/2024  MessageB    - 2

***Line 1

***Line 2

*End of Message

 

  • richardsmc 

     

    Hi, Marvin.

     

    Here's a basic working template to get you started, where I've assumed the example data you've provided is consistently formed.

     

    If, for example, "MessageA" could in fact feature a space, such as "Message A", then some logic would have to be added to the script to allow for that.

     

    Input data file

     

    Script

    $SourceFile = "D:\Data\Temp\Forum\forum.txt";
    
    $SourceDirectory = [System.IO.Path]::GetDirectoryName($SourceFile) + "\";
    $FileOpen = $false;
    $Timestamp = [datetime]::MinValue;
    
    Get-Content -Path $SourceFile |
        ForEach-Object {
            $Line = $_;
    
            # Check this isn't empty space in between files. If so, skip it.
            if ((-not $FileOpen) -and [string]::IsNullOrWhiteSpace($Line))
            {
                # We do nothing here, meaning we skip the empty space between files.
            }
            # Check if we've hit a well-formed line that indicates the start of a new message-aligned file.
            elseif ((-not $FileOpen) -and
                    ($Line.Length -gt 10) -and
                    ([datetime]::TryParse($Line.Substring(0, 10), [ref] $Timestamp) -and
                    (4 -eq ($Parts = [regex]::Split($Line, "\s+")).Count)))
            {
                $NewFileName = [string]::Concat($SourceDirectory, $Parts[1], ".txt");
    
                Out-File -FilePath $NewFileName -InputObject $Line -ErrorAction:Stop;
                $FileOpen = $true;
            }
            # Check if we've hit a well-formed line indicating the end of a file.
            elseif ("*End of message" -eq $Line)
            {
                # This is more of a safety check, since outside of an error condition, $FileOpen should always be $true.
                if ($FileOpen)
                {
                    Out-File -FilePath $NewFileName -InputObject $Line -Append;
                }
    
                $FileOpen = $false;
                $NewFileName = $null;
            }
            # Otherwise, if a file is considered "open", weite the line to it. (Mechanically, the file isn't really open - it's just easier to conceptualise it that way.)
            elseif ($FileOpen)
            {
                Out-File -FilePath $NewFileName -InputObject $Line -Append;
            }
        }

     

    Output

     

    Cheers,

    Lain

3 Replies

  • LainRobertson's avatar
    LainRobertson
    Silver Contributor

    richardsmc 

     

    Hi, Marvin.

     

    Thanks for your private message.

     

    I'll provide a breakdown out here where everyone can see it, as that may benefit other newcomers to PowerShell, too.

     

    Before we get into the script, it's important to be aware that PowerShell is built on top of Microsoft's .NET framework. Not only is it built on top of it, it allows you to access .NET directly through it.

     

    So, when you're running a PowerShell commandlet, it's really just a wrapper for a bunch of .NET code. And if you can't find the perfect commandlet or you just want a more efficient solution, you are free to reference .NET directly to achieve what you're after.

     

    Armed with that, let's look at the script.

     

    Line 1 is simply specifying where to find our original input file.

     

    Lines 3 to 5 are setting up some variables for later use in the script, and it's here on lines 3 and 5 you see some .NET references for the first time.

     

    "System.IO.Path" is a .NET class, where amongst the many different methods and properties it contains is one named GetDirectoryName. GetDirectory name takes a full file path and returns just the directory component.

     

    So, what we're doing on line 3 is picking out just the directory from the file path specified back on line 1. We'll use this directory name later in the script to ensure the new, broken-up files are placed in the same directory as the source file.

     

    Line 5 is the next .NET reference where we are simply creating a new "DateTime" variable by setting it equal to the DateTime.MinValue property. I could have also used a PowerShell alternative like:

     

     

    $Timestamp = Get-Date;

     

     

    But I prefer not to do that since it would make $Timestamp less referenceable in certain scenarios where you need to make comparisons between dates. Not that we're doing that here in your scenario, but I'm a creature of habit and re-use the same approaches where I can.

     

    Line 7 is where we start doing things, specifically, calling Get-Content.

     

    Let's get rid of the bulk of the script and reduce it down to this:

     

     

    Get-Content | <rest of script>;

     

     

    You already know that Get-Content reads a file, but what does it return? The answer is: an array of strings, where each line from the file constitutes an additional array member.

     

    So, what is effectively happening either side of the pipe symbol ("|") is that Get-Content reads a line, then sends that line through to the "<rest of script> for further processing, before reading the next line and doing the same, and so on until the file is done.

     

    Following on, on line 8, the "ForEach-Object" statement is simply saying "for each line passed through (where the line is the "object") from Get-Content", let's do <rest of script>.

     

    "$_" is a special variable to PowerShell where it holds the currently object on the pipeline. So, line 9 is simply assigning the current line from the file that came through to a variable named $Line. We can then safely use $Line later in the script without having to worry about whether the value of "$_" has since changed (which would only happen in nested ForEach-Object calls, which there are none of in this script, but again, it's a habit for me to do this).

     

    The rest of the script is now focused on deciding what to do with $Line:

     

    • Lines 12 to 15 effectively discard $Line - i.e. nothing is done with it, as it represents the unwanted padding between your "message" blocks;
    • Lines 17 to 26 are performing various checks (I'll come back to these later) to see if $Line represents the beginning of a new "message" block;
    • Lines 28 to 38 are checking if $Line represents the end of a "message" block;
    • Lines 40 to 43 are responsible for writing the content in the middle of each "message" block, between the starting and ending lines.

     

    Of these various checks, only the second (lines 17 to 26) is worth digging into.

     

    The "if" statement is spread out over lines 17 for readability:

    • Line 17 begins with checking that we are not already reading a "message" block;
    • Line 18 is checking the length of $Line, since if it's shorter than ten characters in length, it can't possible hold the timestamp that features in your file at the start of each new "message";
    • Line 19 is simply checking if the first ten characters is a timestamp. Again, we're leveraging .NET here to perform that check for us, where DateTime.TryParse will return $true if the first ten characters of $Line is indeed a valid date, or $false if it is not. You'll also notice that the $Timestamp from earlier is used here, preceded by the "[ref]" keyword - but I'll come back to this later. For now, it's enough to know that we are not going to use $Timestamp for anything - it was only necessary to be able to call DateTime.TryParse;
    • Line 20 is splitting apart $Line using the space character as the delimiter. Based on the way you presented your source file, the first line denoting the start of a new "message" block should have four parts, which is where the "4 -eq <rest of line) comes from. The call to .NET's RegEx.Split is simply what's performing the splitting of $Line. RegEx.Split spits out an array of strings which we're assigning to the $Parts variable all on the same line. If $Line really is the beginning of a new "message" block, then $Parts will have four strings in it:
      1. The timestamp;
      2. The message value, such as "MessageA" from your example;
      3. The hypen;
      4. The finishing number;
    • If we pass all the "if" checks, we get to line 22, where we're constructing the new file name to dump the $Line values into. We use the second value from $Parts (the "MessageA" value) to construct the outbound filename;
    • Line 24 simply exports the value of $Line to this new file;
    • Line 25 sets $FileOpen to $true, so we can determine elsewhere in the script what the current status is.

     

    Coming back to the "[ref]" keyword used on line 19, if we call DateTime.TryParse, we can see there's two flavours of the method we can call:

     

     

    In the script, we've used the first definition where only two parameters are needed:

     

    1. The string we're parsing (i.e. "string s");
    2. The already-existing DateTime variable that will be set to the parsed value by DateTime.TryParse.

     

    So, in other words, we have to had already defined a DateTime variable prior to calling DateTime. TryParse, so that DateTime.TryParse has something to return the parsed string's value into.

     

    We met that requirement by defining $Timestamp on line 5.

     

    In summary, DateTime.TryParse provides two things:

    1. Provides a $true or $false as the return value of the function;
    2. Updates a pre-existing DateTime variable to contain the parsed string's value.

     

    Keeping in mind that PowerShell is built on .NET, as you become more proficient, you'll frequently find yourself referring directly to the .NET documentation, where it we look up DateTime.TryParse, we can see the following definition:

     

     

    You'll notice the "out" keyword in front of "DateTime result", which in PowerShell terms is noted as [ref]. i.e. C# "out" = PowerShell [ref], as illustrated in the earlier PowerShell screenshot of the TryParse definitions.

     

    You can also have a read of the following, which more formally explains the difference between passing in a parameter by reference (a pointer to the original variable) rather than by value (a copy of the original variable, meaning the original never gets updated):

     

     

    But at least to begin with, this isn't something you'll come across very often in PowerShell while you're learning, as most of the commandlets you'll be using don't expose these subtleties to you.

     

    Let me know if there's other sections you'd like me to expand on.

     

    Cheers,

    Lain

  • LainRobertson's avatar
    LainRobertson
    Silver Contributor

    richardsmc 

     

    Hi, Marvin.

     

    Here's a basic working template to get you started, where I've assumed the example data you've provided is consistently formed.

     

    If, for example, "MessageA" could in fact feature a space, such as "Message A", then some logic would have to be added to the script to allow for that.

     

    Input data file

     

    Script

    $SourceFile = "D:\Data\Temp\Forum\forum.txt";
    
    $SourceDirectory = [System.IO.Path]::GetDirectoryName($SourceFile) + "\";
    $FileOpen = $false;
    $Timestamp = [datetime]::MinValue;
    
    Get-Content -Path $SourceFile |
        ForEach-Object {
            $Line = $_;
    
            # Check this isn't empty space in between files. If so, skip it.
            if ((-not $FileOpen) -and [string]::IsNullOrWhiteSpace($Line))
            {
                # We do nothing here, meaning we skip the empty space between files.
            }
            # Check if we've hit a well-formed line that indicates the start of a new message-aligned file.
            elseif ((-not $FileOpen) -and
                    ($Line.Length -gt 10) -and
                    ([datetime]::TryParse($Line.Substring(0, 10), [ref] $Timestamp) -and
                    (4 -eq ($Parts = [regex]::Split($Line, "\s+")).Count)))
            {
                $NewFileName = [string]::Concat($SourceDirectory, $Parts[1], ".txt");
    
                Out-File -FilePath $NewFileName -InputObject $Line -ErrorAction:Stop;
                $FileOpen = $true;
            }
            # Check if we've hit a well-formed line indicating the end of a file.
            elseif ("*End of message" -eq $Line)
            {
                # This is more of a safety check, since outside of an error condition, $FileOpen should always be $true.
                if ($FileOpen)
                {
                    Out-File -FilePath $NewFileName -InputObject $Line -Append;
                }
    
                $FileOpen = $false;
                $NewFileName = $null;
            }
            # Otherwise, if a file is considered "open", weite the line to it. (Mechanically, the file isn't really open - it's just easier to conceptualise it that way.)
            elseif ($FileOpen)
            {
                Out-File -FilePath $NewFileName -InputObject $Line -Append;
            }
        }

     

    Output

     

    Cheers,

    Lain

    • richardsmc's avatar
      richardsmc
      Copper Contributor

      LainRobertson 

      Good day,  Thank you again so much for your assistance,  thanks to your explanation I was able to understand and make some slight  changes that didn't use the .DOT NET references.

       

      With regards to the file header the pattern is as follows

      16/07/24-07:35:54      Printer-5592-000034     34

       

      The middle portion is a set length so i was able to match the pattern using regex expressions.   The comparisons for the line being empty was also done using more PowerShell type methods.  Hope that isn't an issue, but it was just easier for me to read and I wanted to keep the solution consistent.  

       

      Here's the modified code.

       

      $INPRINTPATH= " # this is the  path where the files would be stored to be split*"
      $OUTPRINTPATH="#this is the path where the resulting files will be saved "

       

      #$FILEPATH is the full path of all files in the INPRINTPATH directory (so if it's 1 file or multiple it will list them)


      $FILEPATH = Get-Childitem -Path $INPRINTPATH | %{$_.FullName}

       

      #This is the pattern we are looking for to save the file name as once found (it's a mandatory line so it WILL be in the files.


      $PATTERN = 'Printer-[0-9][0-9][0-9][[0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]'
      $FILEOPEN = $false;


      GET-CONTENT -PATH $FILEPATH |
      ForEach-Object {
      $LINE = $_;

      # non DOT.NET method of checking for whitespace.
      if ((-not $FILEOPEN) -and ($LINE.ToString() -eq ""))
      {

      }
      # Non dot net method for finding the pattern  $matches[0] is the object that stores the pattern and you can extract the data from it
      if ($LINE.Substring(0) -match $PATTERN)
      {
      $NEWFILENAME = $OUTPRINTPATH + ($matches[0].Substring(0)) + ".txt";

      Out-File -FilePath $NEWFILENAME -InputObject $LINE -ErrorAction:Stop;
      $FILEOPEN = $true;
      }
      elseif ($LINE.ToString() -eq "*End of Message")
      {
      if ($FILEOPEN)
      {
      Out-File -FilePath $NEWFILENAME -InputObject $LINE -Append;
      }
      $FILEOPEN = $false
      $NEWFILENAME = $null;
      }
      elseif ($FILEOPEN)
      {
      Out-File -FilePath $NEWFILENAME -InputObject $LINE -Append;
      }
      }

       

Resources