Get Data

Copper Contributor

I'm not sure if I'm wording this right. and I'm also new to the following.

I am using the "get data from the web" in Excel. I am receiving the data no problem. but as the file gets larger it takes too long to retrieve the data. is there a way to somehow delete the data every time is retrieved or what can I do so it doesn't take so long to retrieve the data.

 

37 Replies

@Andytcc 

 

What's the nature of the data you're retrieving? What's the source? Does the source itself have a way to limit based, say, on start and end dates?

I'm actually getting the data from a csv file on an SD card in a PLC. I'm doing it with an IP address and the name of the file. every hour the PLC writes 201 values in the next available row in the csv file. and i get the data from the csv file and do graphs and whatnot to my liking.

@Andytcc 

 

So maybe the answer to your question is present in that methodology. Can you not just delete what's on the SD card after each time that you bring it in and convert from CSV to XLSX? Then just append the new data to what you already have in Excel?

 

It's a question: I'm not speaking from experience in bringing data in that way. I routinely bring in data from the internet, but via a different set of steps (specifically, I export from a financial services organization, then import that "up-to-date" data on investments).

That's a good point. Thank you.
I've thought of it, but I don't know how I would delete the data without doing it manually.
If there was a way to delete the data automatically once retrieved, I've got it made.
What benefit would I have converting to XLSX? Doesn't the format have to be CSV for "Get Data"? As for the PLC, it will only write to CSV.
The other thing, I'm only the designer not the end user, even though I'll still have access.
Just to explain a bit deeper how I manage retrieved Data:
Since the CSV file is "New data on bottom" I use "Reverse Rows" and "Keep last Rows". I was hoping "Keep last Rows" would solve the issue not realizing "all" the Data is retrieved every time.
From there I use the table for all my graphs and formulas and so on. Once done and working as planned I'll have a pretty unique and sophisticated yet user friendly program. Any help would be appreciated. Thanks.

@Andytcc 

I've thought of it, but I don't know how I would delete the data without doing it manually.
If there was a way to delete the data automatically once retrieved, I've got it made.

 

And I can't help you with an automated method. Given your hourly volume I can see why that would be useful. Aren't there methods with Apple's and IBM-compatible systems to automate recurring steps like that? [Let's hope somebody else with experience in that area can join in with that advice.]

 

What benefit would I have converting to XLSX? Doesn't the format have to be CSV for "Get Data"? As for the PLC, it will only write to CSV.

 

I was assuming that when you bring the CSV data into Excel, you are converting to Excel format. Is that not the case? Anyway, it was the post-import stage where I was assuming a conversion.

 

The other thing, I'm only the designer not the end user, even though I'll still have access.

 

That definitely complicates things. You'll want the process to be fool proof.

 

Just to explain a bit deeper how I manage retrieved Data:

 

Since the CSV file is "New data on bottom" I use "Reverse Rows" and "Keep last Rows". I was hoping "Keep last Rows" would solve the issue not realizing "all" the Data is retrieved every time.

 

Again, not knowing what kind of data this is, not knowing the source....I can only make a guess: I would hope that there'd be a way--especially if you can delete the data from that SD card--to only get the data that are fresh since the last update.

 

From there I use the table for all my graphs and formulas and so on. Once done and working as planned I'll have a pretty unique and sophisticated yet user friendly program.

 

I wish you well.

Hi @Andytcc 

 

I am using the "get data from the web", later you say I'm actually getting the data from a csv file on an SD card in a PLC. Quite confusing 🙂
Q1: So CSV only, correct?
Q2: Does PLC mean 'Programmable Logic Controller? (not necessarily important but curious)


Excel / Get Data from... = Power Query
When, within Excel (or Power BI), you Get Data from Web, CSV, Excel, SQL, PDF, JSON... there's what's called a Connector (a piece of software) between the Power Query engine and the Data Source. As you understand the in used Connector depends on the targeted Data Source (CSV in your case)
Wonder why I talk about this? There's someone else to be aware of re. query exec./refresh. Things happen in 2 steps: Query optimization, then actual query execution. I encourage you to read

To my knowledge (I can't find out articles/forum article mentioning this) your CSV is accessed twice (is it read 100% each time? I don't know):

  1. During Query optimization
  2. During Query actual execution

It might be read other times depending on what your query does (cf. Chris Blog Post)

 

(Involving another XLS/XLSX won't help at all. Reading a CSV is faster than reading an Excel file - probably due to the complexity of Excel file format/structure)

 

As for the PLC, it will only write to CSV...every hour the PLC writes 201 values in the next available row in the csv file
Q3: As I understand there would be no way/chance to Filter data at the Source (PLC), correct?

 

Since the CSV file is "New data on bottom" I use "Reverse Rows" and "Keep last Rows"
Q4: Not sure I understand the logic here. Shouldn't this be "Reverse Rows" and "Keep first Rows"?

Not sure (to be tested and with Power Query expect 'surprises') but instinctively I would do - only - Keep Bottom Rows (201) instead

 

If there was a way to delete the data automatically once retrieved
No way with Power Query. Power Query gets data (reads only) so it won't write in any case on your SD nor anything else

 

Tiny optimizations you can implement right away in your context (don't expect significant improvements though). In Query Options CURRENT WORKBOOK:
- Data Load: Uncheck 'Allow data previews to download in the background'
- Privacy: Check 'Ignore the Privacy Levels and potentially improve performance'

 

EDIT (forgot)

No idea what the size of your CSV is (#rows ??) nor if this would be acceptable for the end user... Still in Query Options GLOBAL

- Data Load: Check 'Fast Data Load'

=> How Fast is Fast Data Load in Power Query?

 

Other things you should consider:
- If the CSV represents say 10 columns and for your calc. & graph you only need 3, get rid of the other 7 columns as early as possible in your query
- When you get data from CSV, initially columns are typed Text in Power Query Editor. Later in the APPLIED STEPS you probably have a Changed Type step, before you filter (keep top/bottom rows). The Change Type step (assuming you need Typed data for your calc. & graph) should be moved after (or even at the end of the query)

 

If you post your complete query code I can check if there's something else you could do. But don't expect miracles if there's no way to filter/limit the amount of data before the CSV is generated

@SergeiBaklan 

 

Thanks. I didn't mention that one as Curt (in Chris post) said they will likely only support http/https moving forward (5 years later file: - at least - is still supported). If @Andytcc runs a version of Excel that still receive updates and they finally make that change to the Web connector...

@Andytcc 

 

Quick check with CSV of 1M rows x 10 columns:

Reverse Rows + Keep Top 201 Rows is slower than Keep Bottom 201 Rows here 

@SergeiBaklan 

 

I might have misunderstood you. Re-reading, @Andytcc said I'm actually getting the data from a csv file on an SD card in a PLC. I'm doing it with an IP address and the name of the file

 

So, where you trying to say he actually uses Web.Contents (probably) and not File.Contents?

(that would explain the confusion re. I am using the "get data from the web")

@Andytcc 

 

With @Lorenzo  and @SergeiBaklan having joined in, you're now in good hands, far more competent hands than mine. 

@mathetes You're too humble and I'm not that competent, don't set the bar too high 🙂

@Lorenzo 

Perhaps that's me was confused with "get data from the web". Yes, assumed web connector.

Anyway, the task is not clear enough, at least for me. If refresh is once per week that's one story, if every hour that's another. Do we do refresh manually (e.g. on file opening) or Power Automate triggered on csv file update could work. Could we modify csv file update not to keep entire history in it (or keep it separately). Etc.

@Lorenzo @mathetes 

You're too humble and I'm not that competent, don't set the bar too high 🙂

Yes, join.

@SergeiBaklan @Lorenzo 

 

Well, thank you both. I guess we have different competencies...but I do consider my expertise to be more in the region of database design, not in data retrieval with all of its (literally) ins and outs.

@SergeiBaklan 

 

Perhaps that's me was confused with "get data from the web". Yes, assumed web connector

And you're probably right as after re-reading (I overlooked on 1st reading): ...with an IP address

 

Anyway, the task is not clear enough, at least for me

Same here. Let's see how it develops...

Thanks everyone for all the input.
Sorry for not replying sooner but I was out all day yesterday until very late. Just so you know I'm still here I'll give a quick reply.
I have read all your comments (not all the links though) was not sure to who to reply so I picked the last one.
I don't understand some of what was posed since I'm new to power query also a new member to the community and not sure how things are done here so, please bear with me.
You had a few questions in your first reply.
Q1: So CSV only, correct? A: Yes ("Get data from the web through FTP/IP Address/.CSV)
Q2: Does PLC mean 'Programmable Logic Controller? (not necessarily important but curious) A: Yes
Q3: As I understand there would be no way/chance to Filter data at the Source (PLC), correct. A: No (no way possible. loads 201 columns (not rows -my mistake-) in the next row down, limited by size of SD card.
Q4: Not sure I understand the logic here. Shouldn't this be "Reverse Rows" and "Keep first Rows"? A: I could post the actual file but you'll' have to help me. I am limited online so the easiest way would be to somehow post it right here but I don't know how.
I reread your reply and refer to Question 4 in you first reply.

Q4: Not sure I understand the logic here. Shouldn't this be "Reverse Rows" and "Keep first Rows"?

Not sure (to be tested and with Power Query expect 'surprises') but instinctively I would do - only - Keep Bottom Rows (201) instead.

I actually "Keep last Rows" then "Reverse Rows"
Refresh is done every time file is opened could be multiple times a day to weekly.
What I mean by hourly is: PLC will do an hourly write to the csv file of the 201 values.
The reason for hourly is: The logging is for temperature in a Grain Bin. So if I do an hourly log that gives me a good deviation in a 24hr period between min/max temp and therefore I can see which sensors are in the grain and with that the number of Bushels still in the Bin (not accurately to the bushel but a very good idea).