Forum Discussion

Lauren_Michelle's avatar
Lauren_Michelle
Copper Contributor
Jan 04, 2023

HELP! How do I automate data cleaning from a continually updating source

I am working on a sales project that involves many products and several different categories. I have been asked to deliver top line summaries and insights, as well as allowing other staff to check specific week by week product performance.

The data we are receiving from a third party source is mostly sound but has some errors (a few accidental competitor products being included in the list of products sold, some products have 'null' assigned as a category, etc). This means our numbers are slightly off and some products are not being included in overall sales as they are in a 'null' category.

I am able to clean the data manually to assign categories and delete outliers, however, the data is received through an OLAP server feed on a daily basis (I can download and clean it manually day after day but the man hours required for this is not feasible given the sheer volume of different products and time it would take to format and sift through for errors).

I want to be able to correct the errors so that when I refresh and new data is received, the corrections automatically take place. I.e. products with predefined competitor names are automatically deleted/categories are assigned to products dependent on the product name (i.e. if product name include Dairy Whip - change category from 'null' to 'Ice Cream'.

I am fairly new to Excel, but I simply can't find a solution to make this possible and was hoping others might be able to share some advice on how to automate the cleaning of constantly refreshed OLAP data. 

Asking the third party to change the source data is not a possibility.

Thank you in advance!

  • Lorenzo's avatar
    Lorenzo
    Jan 04, 2023

    (SergeiBaklan Thanks for joining)

     

    Lauren_Michelle 

    That is why I decided to manually download the csv file instead of using the IT teams Power BI dataset, as the source data is in a tabular table form, however I will have to pull the data weekly and drop it into an excel sheet and then use Power Query to automate the cleaning process

     

    NO. Once you have a CSV you don't have to "drop it into an Excel sheet", that step is unecessary

    From Excel, you go to Data (tab) > From Text/CSV

    you then select your CSV (the Power Query Editor opens) and you begin your cleaning process

     

    Once you have completed the cleaning process and output/returned the query result to an Excel sheet, next time you download a CSV you copy it (overwrite the previous CSV) in the same folder and Refresh your query

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Hi Lauren_Michelle 

     

    Get & Transform aka Power Query could be what you're looking for. The key question is: can we connect Excel to your data source?

    You said: the data is received through an OLAP server feed on a daily basis (I can download... 

    Q: Where are the data actually stored (i.e. a SQL Database) so you can dowload them?

     

    • Lauren_Michelle's avatar
      Lauren_Michelle
      Copper Contributor

      Thank you for your quick reply! I have been advised to connect to the data via a Power BI database within my organisation using a Pivot Table. The team who manage the database get it from the 3rd party source and do a basic clean of it before uploading it to the database. I also have the ability to download the raw data as a csv file from the third party source, however the data isn't in great shape and the data team clean it before uploading it to their Power BI database. They just don't have the product knowledge or time to go to the level of correcting the issues I have mentioned.

      Unfortunately, I am unable to locate or connect to the database via Power Query, and I believe this is a common issue. I have considered work arounds such a loading the data into a Pivot Table as advised by the data team, and then using Power Query to transform the data to tabular form and edit it.

      I am fairly new to Excel, so want to make sure I don't end up with long complex data chains if there is a more simple solution.

      • SergeiBaklan's avatar
        SergeiBaklan
        MVP

        Lauren_Michelle 

        As variant you may ask them based on dataset to create report with visual which have raw information you need. Having access to such report you may export underlying data into csv or Excel file. Report could be in workspace or in Power BI app.

        If to create paginated report it could be exported automatically with Power Automate.

        But all of that requires additional work from your IT.

        If connect to Power BI dataset cube perhaps that's better to convert PivotTable into cube formulae (or to generate them from scratch for that connection) and adjust them for the corrections. At least that will be refreshable.

        If Power Query, formulae or PivotTable, when from another file to keep refreshing.

Resources