SOLVED

Root bot, skill bot and scaling

Brass Contributor

Hi,

 

We need help with scaling.

 

Problem statement: With 2 instances of root bot running, a skill bot invocation is unable to return results reliably to the root bot.

 

With 1 instance of the root bot and 1 instance of the skill bot, everything works as expected.

 

voonsionglum_0-1638168023558.png

We configured the root bot with "Scale Out" and "Manual Scale" to set the instance count to 2.  When we repeat the interaction, we see that the skill response is not getting returned reliably.  

 

In a successful scenario where the user's skill response is routed back to the root bot that the user interacted with, the user gets the skill results.

 

voonsionglum_1-1638168279544.png

 

In a failed scenario, the user's skill response was routed to the root bot instance that the user did not interact with.  Hence, the results are lost.

 

voonsionglum_2-1638168330384.png

How do we ensure that the skill results are returned to the correct root bot instance?

 

We reviewed the "Skills overview" documentation at https://docs.microsoft.com/en-us/azure/bot-service/skills-conceptual?view=azure-bot-service-4.0, but it does not mention anything about scaling and persistence.

 

We had thought that the lack of persistence could be due to the bots' conversation state.  However, our root bot is currently storing all conversation states in DB.  Since both instances are using the same DB connection string, they should have access to the same conversation states in the DB.

 

Thank You

 

 

 

 

 

32 Replies
@HunaidHanfee-MSFT, would it be possible to have access to the actual web apps behind the manifest you have shared? We would like to view the code via Kudu console and examine the scale out settings that have been applied to both the root bot and skill dialog bot.

Thank you for the update.

I reviewed most of the botbuilder-dotnet code and came to a few conclusions:

- There does not seem to be much code related to ARR at all. My guess is that cookie affinity is not a part of the framework to support pinning root to skill calls.

- Root to skill using DeliveryMode.ExpectReplies looks like it should work (and it sounds like you may have tried it already. Details would be great :-). Check out the code example below, it's a good template to how it works.

https://github.com/microsoft/botbuilder-dotnet/blob/f28cad18948298f30cb7fc4973c143cf08ad7341/tests/S...

Also, check out how it's handled in the SendActivitiesAsync call:

https://github.com/microsoft/botbuilder-dotnet/blob/f28cad18948298f30cb7fc4973c143cf08ad7341/librari...

 

This does not explain why the 404 occurs in the first place. It would be nice to get a definitive answer on whether the root bots are able to share across instances, the skill bot responses.

Thank You Sir! Although, I am really curious how Microsoft is setting up their root bot/skill dialog bot code. I have requested access to view their code and web app settings in their Azure instance. I have not heard back from @HunaidHanfee-MSFT...
Thanks for your reply. However, that is not the example we have been using. As mentioned in the title, we are encountering issues when we implement the root-bot skill-bot scenario, which is modeled after https://github.com/microsoft/BotBuilder-Samples/tree/main/samples/javascript_nodejs/81.skills-skilld.... Kindly retest the scaling issue with the aforementioned sample and let us know if you are able to reproduce the error on your end.

@voonsionglum - 

Apologies for the delay in response, We have setup the above sample and retested with scaling. It is working fine at our end.

 

@voonsionglum @Slacked2737 - add another person to the list of those suffering from the rootbot scaling issue. the sample manifest supplied by Microsoft offers nothing helpful. it doesn't look like anything in there would have anything to do with callbacks to the right rootbot instance. it's just a manifest of skills.

we're also looking into the ExpectReplies workaround. have you tried it? does it work okay? or were you able to get the callbacks to finally work?

@padamide Unfortunately, we are still stuck.  We did not get good results from the ExpectReplies workaround.  I'd be interested to know how it goes on your end. 

@voonsionglum - that's unfortunate news.  We haven't started work on the ExpectReplies workaround yet.  We're still in the process of researching possible solutions.  That's how I came across this thread.

 

What was the result with that workaround?  Did you still run into scaling issues or did it bring on a whole new can of worms entirely?

 

We're also still using some of the deprecated types within the bot framework.  It is a tech debt item to refactor those, however there's no indication that they have anything to do with this scaling issue.

I had to go back and double check with my counterpart, whom I have worked with to resolve the rootbot scaling issue. He was experimenting with the ExpectReplies workaround and what we found was that the workaround did not resolve the rootbot scaling issue. The skillbot was still unable to reliably return the skill results to the correct rootbot instance.

We initially thought that the deprecated types and the NodeJS bot framework version might play a part in the solution not working. However, we updated all of our dependencies and code to use the latest release and we still could not get the solution to work.

Microsoft says they found no issues on their end. I am hoping if your team can reproduce this problem, it would be an incentive for Microsoft to dig a little deeper.
best response confirmed by voonsionglum (Brass Contributor)
Solution

@voonsionglum 

We had James check his data and found this. See if it helps. In the root bot:

 

  • Double check and be 100% sure that you're using the SkillConversationIdFactory that is a part of the MS chatbot framework (NOT one that you may have created). It should have a IStorage constructor parameter that lets you pass in whatever storage you want to use to persist ids used with skills communication. You probably need to use the class that is given to you in the chatbot framework. (i.e. SkillConversationIdFactory that inherits from"Microsoft.Bot.Builder.Skills.SkillConversationIdFactoryBase")

 

  • For the IStorage object used by SkillConversationIdFactory, If you are using some kind of in-memory only storage, (i.e. A ConcurrentDictionary or other MemoryStorage type object), that might be a problem. The code in SkillConversationIdFactory might not be persisting the conversation/skill ID lookup data (needed to talk with skills) into a place that other apps can read.

 

I have found some old MS examples that give a "demo" of how to use skills and shows a SkillConversationIdFactory that uses in-memory storage...which of course won't scale or work across different apps.

 

https://learn.microsoft.com/en-us/dotnet/api/microsoft.bot.builder.skills.skillconversationidfactory...

 

 

@Slacked2737  thanks Stussy!

@Slacked2737 Thank You for this AWESOME reply.  With your suggestion, we made sure to use the SkillConversationIdFactory that is part of the MS chatbot framework and ensured that we have the same non-memory storage passed into it.

 

Note: we have also switched from our previous CosmosDB Mongo API storage implementation to BlobsStorage.  Long story short, with CosmosDB Mongo API, we were getting this.storage.read and this.storage.write errors.  This is not related to Microsoft, but related to the DBStorage class we implemented.

 

We scaled out our root bot to 3 instances and our skill bot to 3 instances.  We restarted both bots and ALL skill results are now getting returned successfully!

 

I can't wait to share this with our Team.  

 

Thank You and Happy New Year!