Jan 04 2023 01:00 PM
Hey everyone! @Excel
I am new to Power Query and need some help. I am trying to automate the data cleaning process as I have source data from a csv file that has repeated matching errors. I need to retrieve new updated sales data every week. Unfortunately, we don't have the ability to change the source data as it from a third party.
As the source data is so large, cleaning it manually over and over every week is not an option as it would take many hours on repeat. The current team have just left the errors and accepted the sales figures being slightly off, but I want to amend this.
Issue #1 - A variety of different products without a category assigned - Several hundred of the products don't have a category assigned to them. I want to replace the 'null' value with the correct category, but there is 5 different categories depending on the group the product is in. I have about 20 wider sub-categories which are all correct, so in a normal excel table I would filter sub-category one by one and add the correct category to the visible products, then repeat until all the null values were filled. However, once I do this is Power Query, I lose all the other data, so only the filtered sub-category is available. Is there any way to replace 'null' values with a condition attached (e.g. replace 'null' with 'Dolls' if sub-category matches 'Barbies'? Or is there a way to stop the filter being irreversible?
Issue #2 - A variety of different products with no brand assigned - Most of the products have a brand in the brand column, but some also have a 'null' value. Every single product name is complete and has a brand within the name. Is there any way to do the replacement of 'null' values based on a partial match in the name column?
Issue #3 - Competitor products accidently included in the data - A handful of competitor products have accidently been included in the data. I would like to delete these as the make a small but false contribution to our sales summary. Is there any way to delete rows that include a certain value? (e.g. Delete rows that contain 'compname' in the brand column).
If none of the above is possible, I will do a work around by making 20 separate 'filtered' queries that combine together into a larger final data query (essentially break them up to automate the cleaning and then bring the cleaned data back together).
I am happy to commit the time upfront to do this, however I am concerned about the complexity and data loading time this would create.
If anyone has any creative solutions to avoid splitting out the query, I would be eternally grateful!!
TIA!
Jan 04 2023 02:53 PM
SolutionIn the attached file you can enter data in the blue dynamic table. Then you can click in any cell of the green table and right-click with the mouse and select refresh to update the green result table.
The Query deletes all the rows that contain "Compname" in the "Brand" column. The filter can be expanded in order to delete rows that contain "Compname 1", "Compname 2" or "Compname 3" in the "Brand" column.
The Query adds "Doll" in the "Category added" column if "Barbie" is the sub-category and if the category is null. This is currently done with an IF THEN ELSE statement which can be adapted for 20 sub-categories.
The Query adds the brand Alpha, Beta, Gamma, Delta, Lambda, Epsilon or Zeta in the "Brand completed" column if there is a partial match in the "Product name" column and if the "Brand" column is empty in the blue table. This is currently done with a nested IF THEN ELSE IF statement and can be adapted if there are e.g 30 brands. However there should be a more suitable solution if you have e.g. 400 brands.
Jan 05 2023 01:27 AM
Issue #3 - Competitor products accidently included in the data
I would like to delete these...Is there any way to delete rows that include a certain value? (e.g. Delete rows that contain 'compname' in the brand column)
Not quite sure this is the best approach. Let's assume that in Week1 you have unexpected [Brand]s A, B and C and you setup your cleaining query to exclude A,B,C. In Week2 your next CSV brings unexpected [Brand]s A, C, E, F => The last 2 won't be excluded by your query
In an ideal world you should know what your [Brand]s are. With that assumption a possible approach would be:
- Setup a table with your [Brand]s
- Merge that table with the Table from CSV
In the attached example I used Table.FuzzyNestedJoin (instead of Table.NestedJoin) as the function has options that help you better control the matching (Power Query is case-sensitive: a <> A):
// CompetitorsRemoved
let
Source = TableFromCSV,
FuzzyMatch = Table.FuzzyNestedJoin(
Source, {"Brand"},
TableBrands, {"Brand"},
"BRANDS", JoinKind.Inner,
[IgnoreCase=true, IgnoreSpace=false, Threshold=1]
),
RemovedColumn = Table.RemoveColumns(FuzzyMatch,{"BRANDS"})
in
RemovedColumn
Hope this makes sense, helps a bit and don't confuse you too much
Jan 05 2023 08:39 AM
Jan 04 2023 02:53 PM
SolutionIn the attached file you can enter data in the blue dynamic table. Then you can click in any cell of the green table and right-click with the mouse and select refresh to update the green result table.
The Query deletes all the rows that contain "Compname" in the "Brand" column. The filter can be expanded in order to delete rows that contain "Compname 1", "Compname 2" or "Compname 3" in the "Brand" column.
The Query adds "Doll" in the "Category added" column if "Barbie" is the sub-category and if the category is null. This is currently done with an IF THEN ELSE statement which can be adapted for 20 sub-categories.
The Query adds the brand Alpha, Beta, Gamma, Delta, Lambda, Epsilon or Zeta in the "Brand completed" column if there is a partial match in the "Product name" column and if the "Brand" column is empty in the blue table. This is currently done with a nested IF THEN ELSE IF statement and can be adapted if there are e.g 30 brands. However there should be a more suitable solution if you have e.g. 400 brands.