Root bot, skill bot and scaling

Contributor

Hi,

 

We need help with scaling.

 

Problem statement: With 2 instances of root bot running, a skill bot invocation is unable to return results reliably to the root bot.

 

With 1 instance of the root bot and 1 instance of the skill bot, everything works as expected.

 

voonsionglum_0-1638168023558.png

We configured the root bot with "Scale Out" and "Manual Scale" to set the instance count to 2.  When we repeat the interaction, we see that the skill response is not getting returned reliably.  

 

In a successful scenario where the user's skill response is routed back to the root bot that the user interacted with, the user gets the skill results.

 

voonsionglum_1-1638168279544.png

 

In a failed scenario, the user's skill response was routed to the root bot instance that the user did not interact with.  Hence, the results are lost.

 

voonsionglum_2-1638168330384.png

How do we ensure that the skill results are returned to the correct root bot instance?

 

We reviewed the "Skills overview" documentation at https://docs.microsoft.com/en-us/azure/bot-service/skills-conceptual?view=azure-bot-service-4.0, but it does not mention anything about scaling and persistence.

 

We had thought that the lack of persistence could be due to the bots' conversation state.  However, our root bot is currently storing all conversation states in DB.  Since both instances are using the same DB connection string, they should have access to the same conversation states in the DB.

 

Thank You

 

 

 

 

 

25 Replies
We are looking into this, I will get back to you soon.

Thanks

@voonsionglum 
Could you please provide the sample code/repo which you are following up?

Hi,

Please refer to https://github.com/microsoft/BotBuilder-Samples/tree/main/samples/csharp_dotnetcore/80.skills-simple... for the sample code.

Once the root bot and skill bot have been deployed, please SCALE OUT the root bot so that it has more than 1 instance. Once scaling is done, try interacting with the root bot.

Thank You
Thanks for sharing the steps, I will get back to you once have the repro.

Thanks
Could you please share how did you scale out? Did you created a new App registration for that or can you share the steps?
Can you please tell us what approach you have followed?

We didn't hear anything on this issue from a while. Feel free to share updates on it.
Hi,

Here are the steps we took to scale out our root bot

1. Click on "Scale out (App Service plan)"
2. Select "Manual scale"
3. Under "Override condition", set "Instance count" to any number other than 1 (eg. 3)
4. Click "Save"
Hello,
We have been trying to get a repro at our end, but it is working fine for us.
Here's what we did -
1. We publish Skill to Azure
2. We publish RootBot to Azure.
3. We installed the manifest and interacted we got the response from skill.
4. We go to Azure and Scale out to two instance. Again interacted and we got response.

Here we have not created any new app registration or given new app Id anywhere(if maybe needed). Please let us know if we are missing something.

@voonsionglum - 
We have deployed a root bot instance attaching the manifest please try it your side and check if you are facing the same  with our app as well. 

Please let us know if you are still facing this issue.

That's very interesting. We were using the botbuilder-samples from 4.9.0. We have not tried the latest ones from 4.15.0. I wonder if that makes any difference. We'll try the latest root bot skill bot samples and let you know how things go.

Thank You
Yes, If issue still persists please let us know what we could be missing while reproducing it at our end.

We're having this exact scaling issue on 4.10. If we try to scale the root bot instances above one, communication made by a skill back to a root bot instance that is NOT the originating root bot produces a 404 (and the skill errors out).

Any luck with finding out whether this is a version issue?

We found this delivery mode option (ExpectReplies), which will tie the call and the response to the same root bot instance, but it seems like it might just be an alternate workaround.

https://docs.microsoft.com/en-us/azure/bot-service/skills-about-skill-consumers?view=azure-bot-servi...

@h1d3m3 - Hello did you checked by installing the manifest I share? Also, can you elaborate more on the repro step to be make sure not missing anything.

Good to know we are not the only ones having this issue :) We upgraded our root bot and skill dialog bot to use the latest 4.15.0 npm packages. We scaled out both the root and skill dialog bots to 2 instances. Sadly, we still face the same error.

Our plan was to redeploy the 4.15 dialog root bot and dialog skill bot samples and scale out the instances to 2. We have been having some trouble overwriting the existing dialog root bot's web app with the 4.15 sample. I'll update again when we get this resolved and test out scaling.

We were not aware of the delivery mode option. Thank You for bringing that to our attention. A workaround is better than nothing :)

Yup, misery loves company :)

We also upgraded to 4.15.1 but it did not solve the problem (within our own application); we still see skills getting 404s when scaling above 1 root instance. We're going to try and see if server affinity or some kind of root bot shared state (i.e. db) options even available at all within the framework, but perhaps delivery mode is the only multi-root option.

 

Its also not yet clear whether the expectReply is built into the sender and receiver framework (i.e. handled automatically by the middleware or other libraries) or is something that will have to be manually coded to keep things synchronous.

 

FYI, a couple of more (semi-useful) references about delivery mode:

 

https://github.com/microsoft/botbuilder-dotnet/pull/5142

https://github.com/microsoft/botbuilder-dotnet/pull/5162

https://github.com/microsoft/botframework-sdk/blob/main/specs/botframework-activity/botframework-act...

 

 

 

By server affinity, are you referring to the ARR affinity setting in the app service? If so, we have already tried setting both the root bot and the dialog skill bot's settings to ON. It did not have any effect.

We initially thought the problem could be due to the way our bots are storing the conversation state. We were using memory as the default storage. We thought that because the scaled bot instances are actually referring to their own memory storage, it could be likely that the bot instances do not have any references to the required conversations states, causing replies to get lost. We switched to using a physical storage and store all of the bots' conversation states in CosmosDB. However, that did not have any impact as well.

Maybe we are doing it incorrectly. If you could try it out on your end, we would like to compare notes and see if we both get the same results.

FYI. I made a few edits to my last reply message above..

- We always have had ARR turned on, but what technology is actually looking for and keeping track of what instance each cookie should be routed to is unclear to us at this point. Typically that is some kind of load balancer or http router...is that just part of the chatbot framework? If it is, it does not seem to be pinning requests correctly (or that cookie is not being passed around correctly).

- We use Cosmos to save conversational state, but whatever lookup operation is producing that 404, the data it wants seems to be pinned in memory for a single root instance. Like you, we were hoping there is something that can be shared between root bots to effectively make that data it is looking for available to all instances. Have not found it yet.

Our next steps are basically more investigation and trial/error....

I just wanted to update and affirm the following

-updating to 4.15 did not resolve the scaling out issue
-deliveryMode works as a workaround