SOLVED

HELP! How do I automate data cleaning from a continually updating source

Copper Contributor

I am working on a sales project that involves many products and several different categories. I have been asked to deliver top line summaries and insights, as well as allowing other staff to check specific week by week product performance.

The data we are receiving from a third party source is mostly sound but has some errors (a few accidental competitor products being included in the list of products sold, some products have 'null' assigned as a category, etc). This means our numbers are slightly off and some products are not being included in overall sales as they are in a 'null' category.

I am able to clean the data manually to assign categories and delete outliers, however, the data is received through an OLAP server feed on a daily basis (I can download and clean it manually day after day but the man hours required for this is not feasible given the sheer volume of different products and time it would take to format and sift through for errors).

I want to be able to correct the errors so that when I refresh and new data is received, the corrections automatically take place. I.e. products with predefined competitor names are automatically deleted/categories are assigned to products dependent on the product name (i.e. if product name include Dairy Whip - change category from 'null' to 'Ice Cream'.

I am fairly new to Excel, but I simply can't find a solution to make this possible and was hoping others might be able to share some advice on how to automate the cleaning of constantly refreshed OLAP data. 

Asking the third party to change the source data is not a possibility.

Thank you in advance!

13 Replies

Hi @Lauren_Michelle 

 

Get & Transform aka Power Query could be what you're looking for. The key question is: can we connect Excel to your data source?

You said: the data is received through an OLAP server feed on a daily basis (I can download... 

Q: Where are the data actually stored (i.e. a SQL Database) so you can dowload them?

 

Thank you for your quick reply! I have been advised to connect to the data via a Power BI database within my organisation using a Pivot Table. The team who manage the database get it from the 3rd party source and do a basic clean of it before uploading it to the database. I also have the ability to download the raw data as a csv file from the third party source, however the data isn't in great shape and the data team clean it before uploading it to their Power BI database. They just don't have the product knowledge or time to go to the level of correcting the issues I have mentioned.

Unfortunately, I am unable to locate or connect to the database via Power Query, and I believe this is a common issue. I have considered work arounds such a loading the data into a Pivot Table as advised by the data team, and then using Power Query to transform the data to tabular form and edit it.

I am fairly new to Excel, so want to make sure I don't end up with long complex data chains if there is a more simple solution.

@Lauren_Michelle 

If your data source is actually a PowerBI Dataset not sure I can help as I have a limited understanding (mainly due to the fact that I don't have access to something like that) of what it is and how it can be used within Excel. So, instead of saying something wrong I prefer to stop there

 

However, reading this page I'm under the impression your Data Team could provide you something easier to consume (i.e. an Excel workbook with Table(s)) with Get & Transform than a PivotTable

 

Despite its name a PivoTable isn't a Table. Connecting Power Query to the worksheet containing a  PivotTable is doable but I would take this approach as the (almost) last option. If your Data Team can provide you an Excel workbook with Table(s) or a CSV file that would be better

@Lauren_Michelle 

As variant you may ask them based on dataset to create report with visual which have raw information you need. Having access to such report you may export underlying data into csv or Excel file. Report could be in workspace or in Power BI app.

If to create paginated report it could be exported automatically with Power Automate.

But all of that requires additional work from your IT.

If connect to Power BI dataset cube perhaps that's better to convert PivotTable into cube formulae (or to generate them from scratch for that connection) and adjust them for the corrections. At least that will be refreshable.

If Power Query, formulae or PivotTable, when from another file to keep refreshing.

I am pretty new to Excel so not 100% versed in everything you have mentioned, but what I can say is that our in houses data team have used the same raw data sent to them automatically from the third party to create a Power BI dataset. They have told me to access this using a Pivot Table in excel, however I am unsure how to pull the underlying data from this, and accessing the Power BI dataset (OLAP) via a Pivot table is very limiting to my ability to automate the data cleaning as I can't turn the data into tabular form (Pivot table parent group restrictions) etc.

That is why I decided to manually download the csv file instead of using the IT teams Power BI dataset, as the source data is in a tabular table form, however I will have to pull the data weekly and drop it into an excel sheet and then use Power Query to automate the cleaning process.

To be honest I am very new to excel so a bit lost :)

best response confirmed by Lauren_Michelle (Copper Contributor)
Solution

(@Sergei Baklan Thanks for joining)

 

@Lauren_Michelle 

That is why I decided to manually download the csv file instead of using the IT teams Power BI dataset, as the source data is in a tabular table form, however I will have to pull the data weekly and drop it into an excel sheet and then use Power Query to automate the cleaning process

 

NO. Once you have a CSV you don't have to "drop it into an Excel sheet", that step is unecessary

From Excel, you go to Data (tab) > From Text/CSV

Sample.png

you then select your CSV (the Power Query Editor opens) and you begin your cleaning process

 

Once you have completed the cleaning process and output/returned the query result to an Excel sheet, next time you download a CSV you copy it (overwrite the previous CSV) in the same folder and Refresh your query

Thank you that makes a lot of sense :)

Hi @Sergei Baklan 

 

A bit off topic so only if you know/have time to check...

When you connect a PowerBI Dataset to Excel for further analysis, are the Tables and Measures "loaded" to the workbook Data Model?

 

If question not clear let me know. Thanks

Hi @L z. 

Connecting to Power BI dataset you are connecting to Analysis Services cube. Thus you see all not-hided measures and tables for this cube. It is not loaded to Excel data model, we have only connection. Thus the only we can do is to create PivotTable based on such cube, or use cube formulae to generate the report. 

The only modification we could add that is MDX calculated measure or member. Having into account quite poor intellisense working with MDX in Excel I'd don't recommend this way if only you are not a MDX pro.

 

In brief, nothing is loaded into Excel data model, we are working with the connected cube.

@L z. 

Forgot to add, we may create implicit measure dragging table field into values if it is allowed for aggregation. 

@Sergei Baklan 

It does help, not for this case but for another thread I've been working on. Thanks again

1 best response

Accepted Solutions
best response confirmed by Lauren_Michelle (Copper Contributor)
Solution

(@Sergei Baklan Thanks for joining)

 

@Lauren_Michelle 

That is why I decided to manually download the csv file instead of using the IT teams Power BI dataset, as the source data is in a tabular table form, however I will have to pull the data weekly and drop it into an excel sheet and then use Power Query to automate the cleaning process

 

NO. Once you have a CSV you don't have to "drop it into an Excel sheet", that step is unecessary

From Excel, you go to Data (tab) > From Text/CSV

Sample.png

you then select your CSV (the Power Query Editor opens) and you begin your cleaning process

 

Once you have completed the cleaning process and output/returned the query result to an Excel sheet, next time you download a CSV you copy it (overwrite the previous CSV) in the same folder and Refresh your query

View solution in original post