how the training restarts after re-connecting from a connection failure

Microsoft


I am doing some training with a simulator running locally on my computer. The training takes a few days to collect episodes so I run the training overnight but somehow the simulator almost always gets disconnected by the next morning (I often see the message "No simulators have connected for training for 3600 seconds. This training session will stop but can be resumed at any time.")

My question is, when I resume a training after this connection-loss, from where does the training resume? Are all collected episodes before the connection-loss saved somewhere, or does it resume from some internal save point?

Why I ask is because when I resume the training, it seems like my agent is struggling in a state that it became good at the day before. That is, in the last 1000 episodes before the disconnection, my agent was succeeding in reaching a difficult-to-achieve state for 96.4% of the time, but when I resumed the training after the disconnection, my agent is now only achieving the state 0.5% of the time.

As for a bit more detail on how I am doing the training, I am using PPO and goals written in inkling.

Before the disconnection, the simulator ran for 400k iterations, was not able to reach the difficult-to-achieve state at the beginning of the training and then learned to reach the state. However, there was no training progress during this state achievement (as far as I can see in the graphical UI; which I am curious to why and how the progress in goal satisfaction is being calculated/updated).

4 Replies
Thanks for posting your questions.

First, let me address the point about the occasional simulator disconnect that you're seeing. Running a simulator locally is great for initial debugging, but we recommend using our "managed simulator" feature when training at scale. With this feature, the platform will run multiple instances of your simulator in the cloud, auto-scale to minimize overall training time, monitor disconnects and crashed sim instances, and restart / reconnect as required. That said, we have received reports from others who want to use locally-run simulators and are struggling with occasional disconnects. We're looking at ways we can make mitigate this issue. In the meantime, if you're able to use the "managed simulator" feature, it should provide a much better experience for you.

To answer your question about how the platform resumes training, I need to explain a few concepts. While training a policy, the platform occasionally (on the order of every 5 minutes) creates a snapshot of the latest policy. We refer to this as the "challenger policy". The platform also maintains a snapshot of the "champion" policy. More on that in a second.

After training for a while (roughly every 20 episodes), an assessment is performed on the latest policy. If its performance is better than the previous champion policy, it becomes the new champion. This technique prevents backsliding ("unlearning") during training.

The RL algorithms we support in the platform (including PPO) make use of something called a "replay buffer" which records the results of recent episodes. The contents of the replay buffer are used when updating the policy. We do not persist the replay buffer, so after resuming training, it can take a little while to collect new samples to repopulate the replay buffer again. Until this happens, you may see that training is making little or no apparent progress. That's because it's difficult for the platform to improve upon the previous champion policy without the benefit of samples in the replay buffer. The amount of time it takes to sufficiently populate the replay buffer will vary based on your sim speed, the number of sim instances you're running, the RL algorithm you're using, and the size of your state space.

I'm not sure I understand what you mean by "there was no training progress during this state achievement". Are you saying that the policy didn't appear to improve after you resumed training? Or are you saying that it appears to have plateaued at 96.4% rather than reaching 100%? It's quite likely that with additional training, you will see further improvements, but achieving the last few percent on a difficult-to-achieve goal may require significantly more training.

I hope that information is helpful. Let us know if anything was unclear or you have any follow-up questions.
Thank you very much for the detailed answers. The information has clarified the questions I had.

What I meant by "there was no training progress during this state achievement" was that, it seemed like (to the human eye) the agent was choosing better actions after 400k iterations of training (before the disconnection), but despite the action choices becoming better, I was not seeing any rise in the goal satisfaction.

Reading your answers, I think what happened was that the "challenger policy" after 400k iterations of training was not yet beating the "champion policy" found early in the training. I guess this could happen if there are multiple goals without weight on which goals are more important. I may have been thinking that an action was becoming better because it was scoring better in the more important goals (achieving the difficult-to-achieve state), but perhaps the "champion policy" was doing better in the overall performance (there were also some other goals I asked the agent such as avoiding some inappropriate states).

One thing that I am still uncertain about is, when the training is resumed, which policy does the platform use to restart the sampling? Does it use the "challenger policy" that was saved before the disconnection, or does it use something close to the "champion policy", or something else?
Training resumes with the most-recently-saved challenger policy. At the next point of assessment, that challenger may replace the saved champion depending on how it performs.
Thank you for confirming. If the training resumes with the most-recently-saved challenger policy, I would assume that before and after the disconnection would have similar action choices. Unfortunately, the action choices seemed like it was reverting to some old policy after the disconnection but perhaps it had something to do with the replay buffer and the champion defending at the point of assessment.

Nonetheless, the connection was great over the weekend and I was able to successfully train my brain! Also, I have tried the "managed simulator" feature as suggested (I wasn't sure on how to do this for my own python-based simulator but then I found some helpful codes in https://github.com/microsoft/cartpole-py ). Indeed, the training was much much faster and stable. Thank you once again for the suggestion and answers!