Bonsai Tic Tac Toe

Occasional Contributor



I'm doing a Tic Tac Toe game, in which one of the two players use bonsai to get the movements. I implemented a simulator in which in each step the action received from bonsai and the action for player 2, this player choose his movement randomly, are performed. I define in Inkling file a positive reward for correct movements and when player 1 wins, and negative reward for wrong movements, player 2 wins or  there aren't available movements. I define a terminal function when player1 wins, player2 wins, bonsai chooses wrong movement or there aren't available movement. The main problem is that bonsai doesn't learn the game rules, when the train ends it continues doing wrong movements (choose a occupied position). Can you help me find the problem? Or does bonsai not work well in this game?. I think it is a difficult problem because bonsai isn't rewarded only for its movement, but for its movement and the movement of the other player.

7 Replies

@victor91 Thanks for trying the platform.


You are correct that this is a difficult problem because the trajectories change based on another agent whose moves are not entirely predictable. 


Would you be willing to share your inkling contents here? Perhaps there are some suggestions we could offer.


A few additional questions:

1. You said that you provide a negative reward for "wrong movements". What constitutes a wrong movement? Is it one that is illegal (i.e. choosing a square that already has an X or an O present)? Or are you applying some heuristic to indicate whether the move was strategically sound?

2. How long have you allowed this to train? How many episodes? For a problem like this, it could take millions or tens of millions of episodes. To do so in a reasonable time, you'd probably want to run many sim instances in parallel.

Hi @erictr,


Yes, I attach the Inkling file.


1. A wrong movement is just an ilegal movement, choosing a square that already has an X or an O present.

2. This trains for 8 hours, 108.000 episodes. Train for these episodes because the NoProgressIterationLimit control stop the training with 250.000 iterations without processing. How can I run many sim instances in parallel?.

@victor91 108K episodes is probably not nearly enough to train a policy like this. If this policy is trainable, it's likely to take 10x to 100x as many episodes. 


Your reward and terminal functions look reasonable to me, although you might want to boost the reward value for a win to something like +100. Otherwise it may learn to draw out the game as long as possible so it can receive multiple +4 rewards rather than quickly win and receive +20.


I also typically recommend using negative reward values for all terminal states that represent failures like "the other side won" and "it's a tie game".


You can manually launch multiple instances of your sim. The Bonsai service will then run multiple episodes in parallel.


If you want to get really adventuresome, you could package your simulator into a docker container, upload it to Azure, and have the Bonsai service automatically manage the launching and scaling of your sim. If you try the "cartpole" or "moab" sample projects, you'll see what that looks like.

Hi @erictr,


When you say "you can manually launch multiple instances of your sim. The Bonsai service will then run multiple episodes in parallel". How can I manually launch multiple instances in my local sim?. Because when I launch a new instance a new simulator is created, and from bonsai you can train a brain choosing only one simulator.




If you want to connect multiple local simulators, you'll have to use the bonsai-cli.

Docs are located here ->

This is the command you probably want to use in this situation ->



Hi @Navvaran_Mann,


I have more questions about this game.


  1. Is the Inkling file well implemented in terms of good practices. I don't know if it is correct to calculate the repeated_movenum_available_moves and winner in the simulator and pass these values in the state to the brain or it is more correct to pass to the brain the board and previous state of the board and calculate in Inkling file if is a correct move, if there are available moves or some player has won.
  2. ¿Is this game a correct example to correctly understand the bonsai project, or are better other examples?. ¿What examples would be better?. I can understand that the bonsai was optimized for other types of projects. I think this game isn't very difficult and as you told me the brain will have to train for many episodes, if I want to train a more complex simulator, for example a simulation with airsim on a dron, would this training be much longer, right?.
  3. Would it be possible to implement this problem using goals instead of terminal and reward functions?. I didn't find a way to do this, because goals only allow ranges, not a concrete value.

Best Regards,

Víctor Vicente.







@victor91 Those are all insightful questions.


The set of state inputs that you make available to the brain is usually dictated by the problem definition. The bonsai platform is designed to solve real-world autonomous systems applications, and the constraints of these real-world problems normally dictate which state inputs are available. In many cases, this translates to a set of sensors that are available in the target deployment environment. Since your example is a "toy" example, you'll need to decide how you want to define the problem.


The sample you've chosen isn't a great example for a few reasons. First, it's a multi-agent problem, and our platform is designed for single-agent. That means the state will change in ways that are somewhat unpredictable between iterations, making it difficult for an RL algorithm to learn optimal actions. Second, it is a problem that is better solved using other traditional approaches rather than reinforcement learning. It's a bit like reaching for a screwdriver to pound in a nail. Third, the reward signals are sparse and don't allow for "reward shaping". Fourth, if you are able to get this policy to converge (and I think you will be able to with sufficient training), the policy will effectively just "memorize" the optimal action for each board state.


You might want to look at our "Moab" sample project. We designed that as an example of a problem that is well-suited for the bonsai platform — and one that can be extended in interesting ways. It's also an example that can be applied to a large class of real-world control problems.


Yes, you should be able to use goals, but it probably won't work any better than a hand-coded reward function. For many problems, goals can perform better because the platform is able to generate reward shaping that helps the policy converge faster. You are correct in noting that goal require ranges, but the range can contain be zero-length (`Goal.Range(1, 1)`). You can use an "avoid" objective to avoid illegal moves and loss conditions.