SOLVED

Replace parts of text that meets certain criteria in a text file

Copper Contributor

I have a text file with approx 58000 lines. The lines holds different types of content for different purposes.

Some numbers that meets certain criterias should change to always have a fixed number of digits and remove any non-numeric characters as per the examples below.

I have struggled with the 3 scenarios below for some time now but I'm not getting there so any assistance would be much appreciated.

 

Scenario # 1: If exist #TRANS and "6" the number following "6", first example "4002", should have leading zeros added to always make it 8 digits within double quotes. Any non-numeric character should be removed.

Current contentWhat it shlould change to
   #TRANS 4319 {"6" "4002"} -22625.00 20220207 "Entreprenad AB"   #TRANS 4319 {"6" "00004002"} -22625.00 20220207 "Entreprenad AB"
   #TRANS 4319 {"6" "63-22"} -22625.00 20220207 "Entreprenad AB"

   #TRANS 4319 {"6" "00006322"} -22625.00 20220207 "Entreprenad AB"

 

Scenario # 2: If exist #TRANS and "25" (no "6") the number following "25", within double quotes, "162" should have leading zeros added to always be 4 digits within double quotes.

   #TRANS 1930 {"25" "1"} -21940.00 20220207 "Levbet"   #TRANS 1930 {"25" "0001"} -21940.00 20220207 "Levbet"

 

Scenario # 3: If exist #TRANS and both "6" and "25" the number following "6" should have any non-numeric characters removed and leading zeroes added for a total of 8 digits. Same as scenario # 1.

When both "6" and "25" exists the number following "25" should hold the 8 digit number following "6" and the 4 digit number following "25" as per scenario 2 in one sequence. 8+4=12 digits within double quotes.

   #TRANS 4319 {"6" "4002-22" "25" "03"} 22625.00 20220207 "Entreprenad AB"   #TRANS 4319 {"6" "00400222" "25" "004002220003"} 22625.00 20220207 "Entreprenad AB"

 

If someone can put together a formula that manages all 3 scenarios then he/she is my hero!

14 Replies

Hi @Hans_Olsson 

Clear problem description + expected results :-))). With Get & Transform aka Power Query:

 

Sample.png

 

NB: This is highly dependent on the quality & consistency of the text between {}. Also, Power Query is case-sensitive so i.e. #TRANS <> #Trans and so far the In table is filtered on [String] starting with #TRANS (easy to adapt if necessary)

 

In attached file could you check that the Out table reflects what you expect against the In table

If fine add some more actuals in In > Right-click in Out > Refresh

(optimizations remain possible but I did not look at this specifically)

@Hans_Olsson 

 

Please re-download attachment from the previous reply. Query updated to not replace records like below if they exist:

 

#TRANS 4324 {"6" "4002-22" "28" "03"} 22625.00 20220207 "Entreprenad AB"

In previous version such records were updated as if the 28 = 25

@L z. Thank you very very much. The attached file clearly does the job. There's just one minor glitch. It requires #TRANS to start in position 1. In the actual file the #TRANS-lines begins with 3 blanks and #TRANS starting in pos 4. If I remove the 3 blanks the query works like a dream.

 

This is the first time I'm in contact with Power Query. I don't manage to bring up the actual query, to see what the code looks like and add the three 3 blanks or even better, make the query look for #TRANS independent of where it starts.

If it's an easy change it would be very much appreciated if you could make it and attach the updated file.

 

Regardless, big thanks for this!

 

@Hans_Olsson 

 

Doable no problem. Question though: do you want to keep the <spaces> before #TRANS in the Out table? Or do you prefer to remove those leading <spaces> in ALL records, or only on the #TRANS record?

If possible the preferred solution is to keep the <spaces> before #TRANS in the Out table so beginning <spaces> in the In and Out tables are identical.
Going forward the current files (In table) will be made obsolete and the Out tables will be the new "current" files.

@Hans_Olsson 

Updated version attached. Whatever leading <spaces> exist in In they're kept it Out

Records starting with #TRANS remain selected as expected

@Hans_Olsson This is embarrassing but the #TRANS-records are not the only type that needs to be modified. In addition the #OBJEKT 6 and #OBJEKT 25 records needs modification, too.

These are the declarations for the #TRANS "6" and "25" records already processed.

 

Current contentModify to
#OBJEKT  6  "2003" "Kastrup 11.1"#OBJEKT  6  "00002003" "Kastrup 11.1"
#OBJEKT  6  "1001-04" "Welcome Hotel"#OBJEKT  6  "00100104" "Welcome Hotel"
#OBJEKT  25 "001" "ŽTA 001"#OBJEKT  25 "0001" "ŽTA 0001"
#OBJEKT  25 "220" "ŽTA 220"#OBJEKT  25 "0220" "ŽTA 0220"

 

This is the last addition. The case won't continue to grow after this.

I'm truly sorry I didn't provide the full picture from start.

@Hans_Olsson 

Quite embarrassing :))

 

Let's make sure we're on the same picture re. #OBJEKT 6 and #OBJEKT 25 as they look different, what  has to be updated doesn't appear to be between {}, correct? They're in your worksheet exactly as below? (leading <spaces> ???)

 

Sample.png

@L z. 

In the previous post I used colors to show what to search for and what should be modiifed.
I didn't explain the colorcoding though. Sorry for that.
- Red: The text in red is what to search for
- Blue: The text to modify and what it should modify to
Please refer to my previous post for the colored text

 

#OBJEKT 6 relates to and follows the same rule as #TRANS "6": Modified text should have leading zeros added to always be 8 digits within double quotes, other characters than digits should be removed.

E.g. Modify to: #OBJEKT 6 "00100104" "Welcome hotel"

 

#OBJEKT 25 relates to #TRANS "25": Modified text should have leading zeros added to always be 4 digits within double quotes. The first blue field in the Modify to cell.
For #OBJEKT 25 the last field within double quotes is a description field. The second/last blue field in the Modify to cell. The same 4 digit number should be mirrored here as well.
E.g. Modify to: #OBJEKT 25 "0001" "ŽTA 0001"

 

Please let me know if I'm not making sense.

@Hans_Olsson 

 

I understand what has to be done re. padding with 0. The key information I still miss is the exact  "formatting" of the #OBJEKT 6 and #OBJEKT 25 strings. Do they have {} or not? If yes, where?

 

Please post a picture showing some examples on a worksheet, this will clarify things

@L z. 

The #OBJEKT 6 and #OBJEKT 25 strings do not have any {}. No.

 

Picture from current worksheet section with #OBJEKT 6 (In obj)

Hans_Olsson_0-1687370676940.png

As an example Lines 468 and 469 should modify to (Out obj):

#OBJEKT  6  "00004023" "Iljansbodaskolan"
#OBJEKT  6  "49012023" "Service 2023"

 

Picture from current worksheet section with #OBJEKT 25 (In obj)

Hans_Olsson_1-1687370760549.png

As an example Line 579 should modify to (Out obj):

#OBJEKT  25 "0048" "ŽTA 0048"

 

best response confirmed by Hans_Olsson (Copper Contributor)
Solution

Hi @Hans_Olsson 

 

Thanks for the pictures. According to the few examples you posted earlier #OBJEKT strings appear to be formatted as below but with copy/paste on the Web we're never sure so please check this first with your actual data

The attached update highly/only relies on those consistent patterns (seems very consistent on your side according to your pictures :))

 

#OBJEKT<2 spaces>6<2 spaces>"2003"<1 space>"Kastrup 11.1"
#OBJEKT<2 spaces>25<1 space>"001"<1 space>"ŽTA 001"

 

Sample.png

@L z. Heureka. You did it. You are the hero!!!

I'm deeply impressed by your skills and your commitment. One of a kind.

Tomorrow it's Midsummer in Sweden where I'm from. Now I can sit back, relax and enjoy the festivities. Big thanks to you!

 

Again, big thanks and all the best,

Hans

@Hans_Olsson 

Glad this worked 1st time & thanks for the feedback

(re. commitment IMHO smart people deserve I do/try my best)

 

Enjoy your long week-end in Sweden...

All the best

1 best response

Accepted Solutions
best response confirmed by Hans_Olsson (Copper Contributor)
Solution

Hi @Hans_Olsson 

 

Thanks for the pictures. According to the few examples you posted earlier #OBJEKT strings appear to be formatted as below but with copy/paste on the Web we're never sure so please check this first with your actual data

The attached update highly/only relies on those consistent patterns (seems very consistent on your side according to your pictures :))

 

#OBJEKT<2 spaces>6<2 spaces>"2003"<1 space>"Kastrup 11.1"
#OBJEKT<2 spaces>25<1 space>"001"<1 space>"ŽTA 001"

 

Sample.png

View solution in original post