SOLVED

Help with Power Query from an online database

Occasional Contributor

Hi all,

 

I am trying to set up a dynamic Excel sheet that allows me to input certain keywords into a cell (or two) to search for matching scientific publications from Pubmed.  I managed to set up two tables using Power Query to show the total publications but the keywords are not dynamic - see picture below & attached link; where Attributes are different keywords, Value is just a text string, and Column1 from row 3 to 6 are simply increasing stringency of search parameters with their corresponding return values.  The hyperlink and xml file as a result of Power Query from B6 are also attached below.

 

YYHoe_0-1656038779680.png

 

Questions:

1. How can I turn the Attributes at A2 & L2 into dynamic text string to fit into Power Query?

2. How can I label Column1s at A3..A6 & L3..L6 into proper labels?

 

I also want to extend the dynamic query at row 6 to capture and list additional information like PMIDs and list them out (B9..B18; maybe with a limit of 20) (see below).

 

YYHoe_1-1656039889089.png

YYHoe_2-1656040050079.png

 

And finally using a Pubmed API function efetch to extract keywords associated with each of the articles defined by its PMIDs to display across the corresponding row from column D.  For example, the cell at D9 contains something like this: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30796508&retmode=xml where the keywords I want to extract to be listed from row D9 are Clinical trial, Curcumin, Diabetes, Metabolic syndrome, NAFLD and Phosphatidylserine. (see below).

 

YYHoe_3-1656040607363.png

 

I hope someone will be able to help.  Any advice and suggestion will be very appreciated.  You can either reply to this thread or directly edit into Book1 file.  Thank you in anticipation.

 

Book1 

Power Query at B6 

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=35708857&retmode=xml 

11 Replies

Hi @YYHoe 

 

Part1 - Stress & Cortisol query
See attached file where I created a table in sheet PARAM with the appropriate links. Up to you to change the Labels

  • My [Number of Results] for Cortisol are accurate, you were summing the [Year] column
  • The Source step of queries Stress & Cortisol is:
= #sections[Section1][StressCortisolLinks]

This is a "trick" to prevent a Power Query Firewall error

 

Part2 - PMIDs

In principal a similar approach can be taken. However, no idea at all what info. you want to extract for each ID nor in which Table that information is strored. Querying i.e. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=35708557&retmode=xml returns a Table with a bunch of nested Tables, some of them having nested Tables as well

==> If you can't make it share a query that returns what you expect for one PMID

 

Part3 - extract keywords by PMID

In which nested Table are the Keywords stored assuming they exist (i.e. no keyword for 35708557 & 35707866)?

I will try out your suggestions in a few hours. Appreciate your reply very much, L z!

Hi @L z. ,

 

Thank you so much for your suggestions.  Let me respond to you in parts below.

 

Part1 - Stress & Cortisol query

Great spot there on my mistake summing the wrong information!  Those numbers looked odd to me but I ignored them until I can solve the bigger issues.  Your worksheet showed everything I wanted to do.  I just need to figure out the details of what you did there and why, so thank you!  And thank you for sharing your "trick" that I had absolutely no knowledge of.

 

Part2 - PMIDs

PMIDs are id tags for all articles that fulfil the search criteria from Part 1.  In this instance, there is actually an API using a hyperlink that returns an xml file with all the information I need nested within <IdList></IdList> (as below).  It limits the results to max 20 IDs so I was hoping Power Query can help me extend that list.  Otherwise, 20 IDs will be a good start for me if not possible. 

 

I managed to extract the IDs using a query in Excel separately.  However, this seems less elegant & lost the link to Part 1 queries.  This discontinuity will prevent me from my ultimate goal in creating a dynamic keyword search in Excel with all these information displayed with one step.

 

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=stress[tiab]+clinical%20tr...

 

YYHoe_0-1656124178323.png

Part3 - extract keywords by PMID

To me, this is the most challenging part of all because the results are varied & are nested many levels.  There are keywords for PMIDs 35708557 & 35707866, yes, but I suppose there maybe articles with no keywords too.  This is, unfortunately, also the most crucial information I need for this project.  What I hope to achieve here is to have a query linking to the ID results in Part 2 to scrape all keywords associated to each PMID.  I suspect all keywords are nested inside both <KeywordList> and <MeshHeadingList>/<MeshHeading> when a specific API is run (see link below).  I tried Power Query but all I got was blank.

 

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=35707866&retmode=xml

 

Final goal

 

What I hope to achieve is to create a dynamic keyword input from cells (perhaps A1..A3) as part of search/query resulting in total matching articles in Part 1 (A2..B6), then linking the matching article PMIDs in Part 2 ($B9..) and the specific keywords (MeshHeadings?) in each PMID in Part 3 (D..$9).  An example of the final output is shown below.

 

YYHoe_2-1656126098012.png

 

I took a closer look at the file StressCortisol.xlsx.  The numbers in the tables are a little off....  For instance, the actual numbers extracted manually using the same hyperlinks are as listed in C10..C13 (see below).  I wonder where the numbers in the table came from.  Will try to go through the query details & see if I can catch them.

 

YYHoe_0-1656131004547.png

 

YYHoe_1-1656131308528.png

 

Hi @YYHoe 

 

Clarification 

Re. "there is actually an API...It limits the results to max 20 IDs so I was hoping Power Query can help me extend that list"

Power Query pulls data through a data connector - an API in your case. In other terms, it doesn't connect directly to a/your database. So, it no way, Power Query can overcome any limitation imposed by the connector in use. If the API limits the result to 20 PMIDs, the max. you can expect is... 20

 

More to come...

 

 

best response confirmed by YYHoe (Occasional Contributor)
Solution

Hi @YYHoe 

 

I think I undestand what you ultimately expect to achieve. Hopefully the following will help

 

Part 1 to 2
Your first query that pulls numbers from i.e. https://pubmed.ncbi.nlm.nih.gov/?term=stress&filter=pubt.clinicaltrial&sort=pubdate returns only 2 fields: [Year] & [# Results] ==> No way to pull the corresponding list of PMIDs


At looked at the construction of the above link and the link to pull PMIDs but couldn't find something consistent enough to auto. construct the second link from the first


Part 2 & 3 - Get keywords for PMIDs
With the URLs you provided at looked at getting the PMIDs and their respective Keyworks from the nested Tables - see attached file


One thing to note regarding the URL for PMIDs. The one you provided is:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=stress[tiab]+clinical%20tr...]

I also tested:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=stress[tiab]+clinical%20tr...]

So, same search criteria but [pt] after clinical%20trial in case 1, [tiab] after clinical%20trial in case 2. Results are different:

 

_Screenshot.png

 

You'll find the correspoding queries in the attached. Hope this helps a bit

Hi @YYHoe 

 

Re. The numbers in the tables are a little off....

IMHO you should look at the query that's actually executed from the web page. For example - still with filter Clinical trial - if you only provide the term stress in the Search box, result = 35644, the actual query is:

("stress"[All Fields] OR "stressed"[All Fields] OR "stresses"[All Fields] OR "stressful"[All Fields] OR "stressfulness"[All Fields] OR "stressing"[All Fields]) AND (clinicaltrial[Filter])

 After reading a bit the E-utilities doc. if I translate the above query as follow and pass it to PQ, result = 35617 (pretty close):

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=%28stress+OR+stressed+OR+stresses+OR+stressful+OR+stressfulness+OR+stressing%29+AND+%28clinical%20trial[Filter]%29

 

Similarly, with your example (stress AND human AND plant) OR (stress AND human AND herb), result = 919, actual query is:

((("stress"[All Fields] OR "stressed"[All Fields] OR "stresses"[All Fields] OR "stressful"[All Fields] OR "stressfulness"[All Fields] OR "stressing"[All Fields]) AND ("human s"[All Fields] OR "humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND ("plant s"[All Fields] OR "planted"[All Fields] OR "planting"[All Fields] OR "plantings"[All Fields] OR "plants"[MeSH Terms] OR "plants"[All Fields] OR "plant"[All Fields])) OR (("stress"[All Fields] OR "stressed"[All Fields] OR "stresses"[All Fields] OR "stressful"[All Fields] OR "stressfulness"[All Fields] OR "stressing"[All Fields]) AND ("human s"[All Fields] OR "humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "herb"[All Fields])) AND (clinicaltrial[Filter])

translated as follow and passed to PQ, result = 918 (almost perfect):

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=%28%28%28stress+OR+stressed+OR+stresses+OR+stressful+OR+stressfulness+OR+stressing%29+AND+%28human s+OR+humans[MeSH Terms]+OR+humans+OR+human%29+AND+%28plant%20s+OR+planted+OR+planting+OR+plantings+OR+plants[MeSH Terms]+OR+plants+OR+plant%29%29+OR+%28%28stress+OR+stressed+OR+stresses+OR+stressful+OR+stressfulness+OR+stressing%29+AND+%28human%20s+OR+humans[MeSH Terms]+OR+humans+OR+human%29+AND+herb%29%29+AND+%28clinical%20trial[Filter]%29

 

Hope this helps

Hi @L z.,

 

Thank you for your clarification.  I will look into them.

Hi @YYHoe 

 

Forgot to mention the other day... re. the limit of 20 PMIDs. This is the Default limit. You can extend the list (assuming more than 20 ID exist) with the optional retmax parameter (tested here, works no problem)

Hi @L z.,

 

You have done it perfectly!  Thank you very, very much!

 

Now I just have to understand all the steps you did in the file . . . . a daunting task for me.

Hi @YYHoe 

 

Glad I could help a bit

 

In the meantime I tested a few things and there's always a significant diff. between what the Web site says (# results), the PubMed query (https://pubmed.ncbi.nlm.nih.gov/?term=...) and the esearch query (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=...) - TBH I don't understand

 

On the other hand I figure out that the esearch query provides a bunch of information, including the PMIDs (this closes the gap between Part 1 & 2). Attached is my last version if this can help

 

Good luck with your project