Forum Discussion

Andytcc's avatar
Andytcc
Copper Contributor
Feb 20, 2024

Get Data

I'm not sure if I'm wording this right. and I'm also new to the following.

I am using the "get data from the web" in Excel. I am receiving the data no problem. but as the file gets larger it takes too long to retrieve the data. is there a way to somehow delete the data every time is retrieved or what can I do so it doesn't take so long to retrieve the data.

 

37 Replies

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Andytcc 

     

    Quick check with CSV of 1M rows x 10 columns:

    Reverse Rows + Keep Top 201 Rows is slower than Keep Bottom 201 Rows here 

  • Lorenzo's avatar
    Lorenzo
    Silver Contributor

    Hi Andytcc 

     

    I am using the "get data from the web", later you say I'm actually getting the data from a csv file on an SD card in a PLC. Quite confusing 🙂
    Q1: So CSV only, correct?
    Q2: Does PLC mean 'Programmable Logic Controller? (not necessarily important but curious)


    Excel / Get Data from... = Power Query
    When, within Excel (or Power BI), you Get Data from Web, CSV, Excel, SQL, PDF, JSON... there's what's called a Connector (a piece of software) between the Power Query engine and the Data Source. As you understand the in used Connector depends on the targeted Data Source (CSV in your case)
    Wonder why I talk about this? There's someone else to be aware of re. query exec./refresh. Things happen in 2 steps: Query optimization, then actual query execution. I encourage you to read

    To my knowledge (I can't find out articles/forum article mentioning this) your CSV is accessed twice (is it read 100% each time? I don't know):

    1. During Query optimization
    2. During Query actual execution

    It might be read other times depending on what your query does (cf. Chris Blog Post)

     

    (Involving another XLS/XLSX won't help at all. Reading a CSV is faster than reading an Excel file - probably due to the complexity of Excel file format/structure)

     

    As for the PLC, it will only write to CSV...every hour the PLC writes 201 values in the next available row in the csv file
    Q3: As I understand there would be no way/chance to Filter data at the Source (PLC), correct?

     

    Since the CSV file is "New data on bottom" I use "Reverse Rows" and "Keep last Rows"
    Q4: Not sure I understand the logic here. Shouldn't this be "Reverse Rows" and "Keep first Rows"?

    Not sure (to be tested and with Power Query expect 'surprises') but instinctively I would do - only - Keep Bottom Rows (201) instead

     

    If there was a way to delete the data automatically once retrieved
    No way with Power Query. Power Query gets data (reads only) so it won't write in any case on your SD nor anything else

     

    Tiny optimizations you can implement right away in your context (don't expect significant improvements though). In Query Options CURRENT WORKBOOK:
    - Data Load: Uncheck 'Allow data previews to download in the background'
    - Privacy: Check 'Ignore the Privacy Levels and potentially improve performance'

     

    EDIT (forgot)

    No idea what the size of your CSV is (#rows ??) nor if this would be acceptable for the end user... Still in Query Options GLOBAL

    - Data Load: Check 'Fast Data Load'

    => How Fast is Fast Data Load in Power Query?

     

    Other things you should consider:
    - If the CSV represents say 10 columns and for your calc. & graph you only need 3, get rid of the other 7 columns as early as possible in your query
    - When you get data from CSV, initially columns are typed Text in Power Query Editor. Later in the APPLIED STEPS you probably have a Changed Type step, before you filter (keep top/bottom rows). The Change Type step (assuming you need Typed data for your calc. & graph) should be moved after (or even at the end of the query)

     

    If you post your complete query code I can check if there's something else you could do. But don't expect miracles if there's no way to filter/limit the amount of data before the CSV is generated

  • mathetes's avatar
    mathetes
    Silver Contributor

    Andytcc 

     

    What's the nature of the data you're retrieving? What's the source? Does the source itself have a way to limit based, say, on start and end dates?

    • Andytcc's avatar
      Andytcc
      Copper Contributor
      I'm actually getting the data from a csv file on an SD card in a PLC. I'm doing it with an IP address and the name of the file. every hour the PLC writes 201 values in the next available row in the csv file. and i get the data from the csv file and do graphs and whatnot to my liking.
      • mathetes's avatar
        mathetes
        Silver Contributor

        Andytcc 

         

        So maybe the answer to your question is present in that methodology. Can you not just delete what's on the SD card after each time that you bring it in and convert from CSV to XLSX? Then just append the new data to what you already have in Excel?

         

        It's a question: I'm not speaking from experience in bringing data in that way. I routinely bring in data from the internet, but via a different set of steps (specifically, I export from a financial services organization, then import that "up-to-date" data on investments).

Resources