Jun 20 2021 08:15 PM
Jun 20 2021 08:15 PM
I'm trying to figure out how to let Bonsai solve a multi-armed bandit problem with a small number of iterations.
Let's say there are two action choices, A and B. Selecting A always returns a value of 0, and selecting B always returns a value of 1. Bonsai wants to maximize the expected returned value in an episode of 5 iterations. Under such conditions, Bonsai took more than 50,000 iterations to converge using the APEX algorithm with the default configuration.
As the size of state space seems to be relatively small for this case, I wonder if the required iterations can be reduced by modifying the conditions. I experimented with different sizes of hidden layers, QLearningRate, EpisodeIterationLimit, and reward functions, but the required iterations were similar.
I appreciate any suggestions to facilitate the training of Brain for such a case. I attach the Inkling file.
Jun 21 2021 11:19 AM - edited Jun 21 2021 11:27 AM
Edi here from Bonsai.
First of all, thanks for your inquiry and your interest in Bonsai.
A couple of questions for you:
- can you please change the Goal.range to be between 0-5 instead of being between 0-1? - your ret_val is also 0-1, which means that you need a broader range for the goal to explore with that horizon of 5.
- have you tried to use a reward function and terminal function, where the reward function is just returning the ret_value of the state and the terminal function always returns True? - (this is just to literally try the multi-armed bandit problem)
- if you keep your horizon bigger than 1, with any T (like 5 in your case), have you tried to modify your simulator in such a way that can help policy learning to uncover the distributions more quickly? - one example of this would be using incremental actions and changing your state-space to keep track of previous actions, then, based on your horizon T, at each step, you select in the simulator how to increment your previous actions. Basically, here you are aiming to uncover the distributions at the end of the episode (time T) rather than at a single shot (T=1).
The last option I'm asking about is a little more involved. If you have not tried something like that before, we could provide you more details. The team has previous experiences applying this approach in real complex scenarios.
Jun 22 2021 08:00 AM
Thank you very much for your advice.
Following your suggestions, I have tried the first two options, 1) testing with Goal.Range(0, 5) and 2) testing with the reward function and terminal function (I have attached the Inkling file). Both options seem to require a similar number of iterations to converge (i.e., more than 50,000 iterations). I would be interested in learning more about the third option.
-By the way, I tried to look at the Azure log query to see how the reward values differ for the different actions (i.e., A, B). The difference of reward values between the actions was more pronounced when using Goal.Range(0, 1) than Goal.Range(0, 5). The log gives me a better idea of what kind of reward value is being calculated.
Thank you again!
Jun 22 2021 04:50 PM
Jun 22 2021 06:21 PM
I have attached the sim code. Thank you for your help!
Jun 29 2021 10:38 PM
I would like to let you know that I have figured out a workaround: using SAC algorithm instead of APEX.
When using APEX algorithm, it seemed to be challenging to reduce the number of parameters to be optimized, since users can adjust only a part of the whole network structure. On the other hand, when using SAC or PPO, it seemed that users could adjust most of the network structure through inkling descriptions. We gained this insight by checking the networks stored in the exported Brains.
As our problem dealt with discrete actions, we applied SAC and thresholded the continuous action values (e.g., 0-0.5 to be action A and 0.5-1 to be action B). As a result of using a small network, the learning converged after 4000 iterations. We think that this workaround is effective enough to address this issue.
Thank you for your support!