Jan 15 2021 01:10 AM
Jan 15 2021 01:10 AM
I'm doing a Tic Tac Toe game, in which one of the two players use bonsai to get the movements. I implemented a simulator in which in each step the action received from bonsai and the action for player 2, this player choose his movement randomly, are performed. I define in Inkling file a positive reward for correct movements and when player 1 wins, and negative reward for wrong movements, player 2 wins or there aren't available movements. I define a terminal function when player1 wins, player2 wins, bonsai chooses wrong movement or there aren't available movement. The main problem is that bonsai doesn't learn the game rules, when the train ends it continues doing wrong movements (choose a occupied position). Can you help me find the problem? Or does bonsai not work well in this game?. I think it is a difficult problem because bonsai isn't rewarded only for its movement, but for its movement and the movement of the other player.
Jan 15 2021 09:13 AM
@victor91 Thanks for trying the platform.
You are correct that this is a difficult problem because the trajectories change based on another agent whose moves are not entirely predictable.
Would you be willing to share your inkling contents here? Perhaps there are some suggestions we could offer.
A few additional questions:
1. You said that you provide a negative reward for "wrong movements". What constitutes a wrong movement? Is it one that is illegal (i.e. choosing a square that already has an X or an O present)? Or are you applying some heuristic to indicate whether the move was strategically sound?
2. How long have you allowed this to train? How many episodes? For a problem like this, it could take millions or tens of millions of episodes. To do so in a reasonable time, you'd probably want to run many sim instances in parallel.
Jan 18 2021 12:01 AM
Yes, I attach the Inkling file.
1. A wrong movement is just an ilegal movement, choosing a square that already has an X or an O present.
2. This trains for 8 hours, 108.000 episodes. Train for these episodes because the NoProgressIterationLimit control stop the training with 250.000 iterations without processing. How can I run many sim instances in parallel?.
Jan 18 2021 12:34 AM
@victor91 108K episodes is probably not nearly enough to train a policy like this. If this policy is trainable, it's likely to take 10x to 100x as many episodes.
Your reward and terminal functions look reasonable to me, although you might want to boost the reward value for a win to something like +100. Otherwise it may learn to draw out the game as long as possible so it can receive multiple +4 rewards rather than quickly win and receive +20.
I also typically recommend using negative reward values for all terminal states that represent failures like "the other side won" and "it's a tie game".
You can manually launch multiple instances of your sim. The Bonsai service will then run multiple episodes in parallel.
If you want to get really adventuresome, you could package your simulator into a docker container, upload it to Azure, and have the Bonsai service automatically manage the launching and scaling of your sim. If you try the "cartpole" or "moab" sample projects, you'll see what that looks like.
Jan 18 2021 12:58 AM
When you say "you can manually launch multiple instances of your sim. The Bonsai service will then run multiple episodes in parallel". How can I manually launch multiple instances in my local sim?. Because when I launch a new instance a new simulator is created, and from bonsai you can train a brain choosing only one simulator.
Jan 19 2021 04:29 PM
This is the command you probably want to use in this situation -> https://docs.microsoft.com/en-us/bonsai/cli/simulator/unmanaged/connect
Jan 20 2021 11:54 PM
I have more questions about this game.
Jan 21 2021 09:31 AM
@victor91 Those are all insightful questions.
The set of state inputs that you make available to the brain is usually dictated by the problem definition. The bonsai platform is designed to solve real-world autonomous systems applications, and the constraints of these real-world problems normally dictate which state inputs are available. In many cases, this translates to a set of sensors that are available in the target deployment environment. Since your example is a "toy" example, you'll need to decide how you want to define the problem.
The sample you've chosen isn't a great example for a few reasons. First, it's a multi-agent problem, and our platform is designed for single-agent. That means the state will change in ways that are somewhat unpredictable between iterations, making it difficult for an RL algorithm to learn optimal actions. Second, it is a problem that is better solved using other traditional approaches rather than reinforcement learning. It's a bit like reaching for a screwdriver to pound in a nail. Third, the reward signals are sparse and don't allow for "reward shaping". Fourth, if you are able to get this policy to converge (and I think you will be able to with sufficient training), the policy will effectively just "memorize" the optimal action for each board state.
You might want to look at our "Moab" sample project. We designed that as an example of a problem that is well-suited for the bonsai platform — and one that can be extended in interesting ways. It's also an example that can be applied to a large class of real-world control problems.
Yes, you should be able to use goals, but it probably won't work any better than a hand-coded reward function. For many problems, goals can perform better because the platform is able to generate reward shaping that helps the policy converge faster. You are correct in noting that goal require ranges, but the range can contain be zero-length (`Goal.Range(1, 1)`). You can use an "avoid" objective to avoid illegal moves and loss conditions.