Forum Discussion

Pedro Correa's avatar
Pedro Correa
Brass Contributor
Jun 03, 2021

When to use and how to get querying proficiency

Alright, I will try to be short here but anyone let me know if more info is needed.

We had this new project started 3 months ago, that was increasing the amount of data it was going inside our Elastic Search today, I suggested migrating to azure search for this new phase of the project, and so I started migrating.

Q1. Any recommended way for fast ingesting? 
 Using azure durable functions to ingest about 600k documents was taking an absurd amount of time, like over 4 hours; using the dotnet sdk, with batching by 100 documents. Documents are not large, but a bit complex.
Q2. Querying, anything out of the basics its really hard to find information for specially when using the dotnet sdk 11. Example, I have a sort that needs to sort results by one of the fields value inside an object inside the document,(2nd level) where fieldX= A, should come first and fieldX=B should come second and so on...  which seemed like a good case for semantic scoring... but again, couldn't find any example on how to do that... (suggestion, a querying playground with the hotel data set would be very helpful on that)
Q3 - Maybe a bug, any way to manipulate the field conversion during casting of result set? Background info->searchClient.SearchAsync<DocumentsResult>(query, options ); one of the fields inside DocumentResult is a string which the value can contain "8" or "A8", I can see the data when querying the index on the portal its there, but when calling the value for this field is always null...

Sorry for the long post....
anyway, long story short migrated everything to managed sql server, where querying is extremely easy last week(due to time crunch) but since the AMA was coming up figured I could try getting some answers for this for next time or phase of the project.


Thanks

Pedro

 

  • Hello Pedro - thank you for your question.

    Regarding your 2nd question, you could use the "orderby" parameter to include your custom ordering functions. Please refer to https://docs.microsoft.com/en-us/azure/search/search-query-odata-orderby and https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents for more details. If you have issues using the .Net SDK,

    Did you mean 'semantic search' when you said 'semantic scoring'? If not, can you please elaborate on that.
  • Hi Pedro, let me start with Q1. You are correct that pushing content and doing batching will be the most optimal way of getting content into the search service quickly. Just as a side note, the S2 and higher is backed by premium storage which also allows indexing to happen faster. However, the added cost does not always warrant the increase perf. You might also be interested in this code that we have for optimizing indexing performance that helps understand optimal batch sizes. https://github.com/Azure-Samples/azure-search-dotnet-samples/tree/master/optimize-data-indexing

    In addition, please keep in mind that you can also parallelize uploads which can allow you to push data even fasters. However, if you do this it is important to keep track of throttling and back off exponentially if you start seeing this. The above sample helps walk through this as well.

    Hope that helps!

    Liam
  • Hey Pedro - I'm not quite sure I understand your third question. Are you trying to select a certain field but the data for it isn't being returned? Are you serializing the data and that's when it becomes null?
    • Pedro Correa's avatar
      Pedro Correa
      Brass Contributor
      That is correct, in the deserialization process the data end up null.
      • DerekLegenzoff's avatar
        DerekLegenzoff
        Icon for Microsoft rankMicrosoft
        Thanks for the clarification. The reason that the data is ending up null is because a decision was made that we want customers to own their data models as a best practice so the in-built data types don't necessarily serialize all fields (you'll run into a similar issue with counts on facets). To get around the issue you're running into, I would recommend mapping the data from the response into classes that you create. You can do this pretty quickly with a package like Mapster: https://github.com/MapsterMapper/Mapster

Resources