SOLVED

Trying to make sense of the assessment statistics

Contributor

I trained a brain and was looking at the assessment statistics, and I couldn't make heads or tails of what it actually was showing as there was seemingly conflicting results. I tried looking at the help pages for the Assessment UI and the Goal definitions but it didn't answer my confusion. Maybe someone has an idea.

 

I trained using a curriculum that had 6 goals: 2 avoids, 3 drives (all using "within") and a minimize. At the end, in the automatic assessment statistics, it reported the following (same as in the attached image of the charts):

 

==========================================
|          | Success | Satisfy | Satisfy |
| | rate | min | max | ========================================== | Overall | 60% | 95.22% | 100% | ------------------------------------------ | Avoid #1 | 100% | 100% | 100% | ------------------------------------------ | Avoid #2 | 100% | 100% | 100% | ------------------------------------------ | Drive #1 | 100% | 100% | 100% | ------------------------------------------ | Drive #2 | 86.67% | 99.78% | 100% | ------------------------------------------ | Drive #3 | 100% | 100% | 100% | ------------------------------------------ | Minimize | 66.67% | 71.3% | 100% | ==========================================

How it possible to have an overall success rate of 60% when none of the individual objectives have a success rate that low, let alone a minimum satisfaction at/below below 60%?

A similar confusion is with Drive #2 and Minimize.

 

From the help pages, I know that the success rate is "the fraction of episodes in an assessment where the AI achieves the objective" and a "satisfaction of 100% means the AI successfully completed the objective". But I don't see how these aren't correlated more closely. For the overall stats, does this mean the brain "completes" the objectives [95.22, 100]% of the time but can only "achieve" them in 60% of the episodes...?

 

Thanks for any input.

 

 

2 Replies
best response confirmed by TWolfeAdam (Contributor)
Solution

@TWolfeAdam thanks for posting the question.

 

Let me start by defining a couple of terms, in case our documentation wasn't sufficiently clear.

 

The "success rate" is a measure of the % of episodes where that objective was achieved. Each assessment pass typically involves 30 episodes, so if an objective is achieved 15 out of 30 of those episodes, the success rate for that objective would be 50%. Note that "success" and "failure" is a binary determination within each episode.

 

The "overall success rate" indicates the % of episodes where _all_ objectives are achieved. In the stats that you posted, your overall success rate was 60%, which means that in 18 out of the 30 episodes, all six objective succeeded, but in 12 out of the 30 episodes at least one of them failed. Looking at the per-objective stats, we can see that these failures were all due to "Drive #2" and "Minimize" objectives, which the policy is still struggling with.

 

When we were designing the "goals" feature, we realized that the binary nature of success rates (either success or failure) failed to provide a clear picture of the policy's progress as it learned. For example, if a "drive" objective manages to get very close to the target region, that's much better than if it fails to get close. Yet, both cases represent a "failure" when calculating the success rate. For that reason, we came up with another metric called "goal satisfaction". Rather than being binary (succeed or fail), the goal satisfaction metric is a score from 0% to 100% that represents how well the policy did on that objective during the episode. If it got really close to achieving the objective, it might get a goal satisfaction of 95% even if it ultimately failed.

 

The exact formula for goal satisfaction varies by objective type (reach, drive, etc.), but you can assume that goal satisfaction numbers should continue toward 100% as the policy improves during training. If you see these numbers plateau short of 100%, then you know the policy is struggling to learn. If you see a broad distribution of goal satisfaction scores, then the policy is still learning how to handle certain parts of the state space.

 

The "Satisfy min" and "Satisfy max" columns indicate the range of the goal satisfaction scores that were seen over 30 episodes. If you want to see a full histogram of goal satisfaction scores, you can click on the assessment results in the UI.

 

I hope that clarifies what you're seeing.

 

Please keep the questions and feedback coming!

 

 -Eric

Hi Eric,


Please excuse my large delay in responding, but thank you so much for your thorough response! It is now entirely clear what these are reporting and why both are significant in their own way. Your explanation has really aided in the experimenting that I'm doing and has vastly improved the usefulness of these metrics!

 

If it's wanted/of any use, here's some feedback in the form of what I think played into my confusion: I initially assumed they were used casually (and thus interchangeably). What may add to this is their visual proximity to one another - as in, the success card is grouped in with the satisfaction related cards (it's 'touching' both of them), so you can't mentally draw a straight line separating the two. Oddly, I also feel like both starting with "s" adds to my brain's intermingling of them.

 

I realize though that refactoring the UI or changing the words used is no small feat. The only (relatively) simple fixes I can think of is improving the explanation in the relevant help article (I find yours to be exponentially more clear), or a little information-symbol with a little blurb in the hover popup.

 

Anyway, thank you again!