Feb 26 2021 12:41 AM
I am doing some training with a simulator running locally on my computer. The training takes a few days to collect episodes so I run the training overnight but somehow the simulator almost always gets disconnected by the next morning (I often see the message "No simulators have connected for training for 3600 seconds. This training session will stop but can be resumed at any time.")
My question is, when I resume a training after this connection-loss, from where does the training resume? Are all collected episodes before the connection-loss saved somewhere, or does it resume from some internal save point?
Why I ask is because when I resume the training, it seems like my agent is struggling in a state that it became good at the day before. That is, in the last 1000 episodes before the disconnection, my agent was succeeding in reaching a difficult-to-achieve state for 96.4% of the time, but when I resumed the training after the disconnection, my agent is now only achieving the state 0.5% of the time.
As for a bit more detail on how I am doing the training, I am using PPO and goals written in inkling.
Before the disconnection, the simulator ran for 400k iterations, was not able to reach the difficult-to-achieve state at the beginning of the training and then learned to reach the state. However, there was no training progress during this state achievement (as far as I can see in the graphical UI; which I am curious to why and how the progress in goal satisfaction is being calculated/updated).
Feb 26 2021 08:10 AM
Feb 26 2021 10:38 AM
Feb 26 2021 10:54 AM
Feb 28 2021 08:27 PM