Feb 21 2024 05:29 AM
I have written an some python to send a request to the graph API to search a OneDrive for files containing keywords.
def get_results(keywords, drive_id, next_url=None, total_results=[]):
search_query = ' AND '.join(keywords)
top = 500 # Set the desired number of results to retrieve
select_properties = "id,name,createdDateTime" # Example properties to select
if not next_url:
base_search_url = f"https://graph.microsoft.com/v1.0/drives/{drive_id}/root/search(q=\'{search_query}\')"
top_search = f"$top={top}"
search_url = f"{base_search_url}?{top_search}"
else:
search_url = next_url
headers = {
'Authorization': 'Bearer {}'.format(access_token),
'Content-Type': 'application/json'
}
response = requests.get(search_url, headers=headers)
search_results = response.json()
total_results += search_results.get('value', [])
if '@odata.nextLink' in search_results:
next_url = search_results['@odata.nextLink']
get_results(keywords, drive_id, next_url, total_results)
return total_results
However, when I run this, the result is that I get around 7k results, but I know for a fact there should be around 10k results. When I search the files in the OneDrive using body: keyword1 AND keyword2 I get all files returned, but with this method using the skip tokens, some files are not returned.
On top of this, I also get duplicates, so if the 7k files, only 6k are unique.
I'm not sure what I've done wrong with this method, I'm wondering if I could get some assistance on this.
Thanks 🙂
Jul 18 2024 07:15 AM - edited Jul 18 2024 11:08 AM
@hpage910 I'm commenting to bring awareness to this issue. We ran into this exact same problem and incorrectly assumed it was the SDK we were using to make the api calls. However, when we decided to test it using the web graph explorer we still ran into duplicate search results.
We uploaded a folder containing 10k unique text files (about 1kb each) to OneDrive using the OneDrive Windows Sync Client. We then gave it a few weeks to make sure everything synced across the graph api service.
The test files have a modified date of April 15th so we did a Modified Date Range query between April 10th - April 20th. The Graph API correctly identifies the correct "Total" of 10k documents in the response object, using pagination of 50 files per request it fetched 10k documents but when we analyzed the document ids in the request we were getting about 8,902 unique documents and 1,098 duplicates.
There are many people running into this issue and there is no clarity about a solution. Please Microsoft Support fix this. we need the api to be deterministic and accurate. We tried using the "trimDuplicates" property in the request, and adding a sort property to prevent it from sorting by relevancy and still the same results