SOLVED

Bonsai Cost Overrun Issue

MVP

I recently created a CartPole demo project with Bonsai. I ran the code, trained the model, evaluated the model, and then closed my browser window.

 

I was under the impression that Bonsai shuts down its servers in the background and suspends any further billing. However, a few days later, I received an email from Azure saying that I've exceeded my spending limit and that my Azure services have been disabled.

 

After looking at the Cost Analysis tab it appears that Bonsai generated $110 in charges over the 6 days after I created and evaluated my model. To the best of my knowledge, nothing should have been running during these days. However, the costs still accumulated.

 

I filtered the costs by the Bonsai resource group to isolate the specific charges. It appears that $84.17 was generated by log analytics and $44.63 was generated by container instances. I've included a screenshot of the cost analysis (filtered by the Bonsai resource group) for your reference.

 

Finally, I deleted the Bonsai resource group on Mar 3 to prevent the accumulation of additional charges. Once deleted, I was no longer charged -- as you can see from the chart in the screenshot.

Any thoughts on what would have caused this cost overrun?

8 Replies
Hi Matthew, thanks for contacting us.

We would like to dig into this issue and understand better what happened. You mentioned that you deleted the resource group already. By any chance, do you still have the brain ID or the bonsai workspace ID? These would help us access our logs so we can dig into the problem.

I'm guessing that you were using a free personal Azure account? Or was this a student or corporate account? I'm going to look into the possibility of crediting you back the charges you incurred.

-Eric
Eric,

Thanks for the quick reply.

Unfortunately, I no longer have the brain ID or the Bonsai workspace ID. I do, however, have the last simulator instance ID from one of my last training sessions. It was "1546590347_10.244.92.47". Hopefully, that ID provides you with a way to determine the brain ID or workspace ID.

I'm a Microsoft MVP, so I get an allotment of free Azure credits each month. So, no need to credit my account back for the charges -- I'm just reporting the issue to help you guys identify the issue and fix the problem going forward.

Please let me know if you need any additional information to help resolve this issue.

Thanks!

Matthew
The sim IDs are ephemeral, so that unfortunately doesn't help us out much. We could probably identify the relevant logs based on your subscription ID, if you'd be willing to DM that to me.

We are motivated to understand this problem because training should have auto-terminated after some time, but it sounds like it did not stop for some reason.

You mentioned that you were using the CartPole demo. Do you happen to remember if you modified the demo in some manner (e.g. tweaked the goals)? If it's the unmodified sample, that's even more unexpected because we use that in our test suite, which runs many times every day.

Thanks again for your bug report and the extra information you've provided.

-Eric
Eric,

Just by sheer luck, I discovered today that I had a URL for my CartPole project saved in a document. The URL appears to contain the workspace ID. So it looks like we're in luck!

The workspace ID is: 201f1096-3d61-41af-af84-f8c5ce02d16f

Hope this helps!
Oh, that's great news. Thanks! We'll dig into it further and let you know what we learn.

@erictr 

 

In response to your other question, yes, I did modify the CartPole demo code. I created about 8 different versions of the code for various presentations. However, here's the last version that I ran:

 

# Reinforcement Learning with Microsoft Bonsai

# Specify the language version
inkling "2.0"

# Import libraries
using Math
using Goal

# Set the maximum angle of pole (in radians)
const MaxPoleAngle = (12 * Math.Pi) / 180

# Set the length of the cartpole track (in meters)
const TrackLength = 0.5

# Create a type that represents the agent's state
type AgentState {    
    cart_position: number,
    cart_velocity: number,
    pole_angle: number,
    pole_angular_velocity: number,
}

# Create a type that represents the agent's action
type AgentAction {
    command: number<-1 .. 1>
}

# Create a concept graph with a single concept to be learned
graph (input: AgentState): AgentAction {
    concept BalancePole(input): AgentAction {
        curriculum {  
            
            # Specify the training source is the cartpole simulator
            source simulator (Action: AgentAction): AgentState {
                package "Cartpole"
            }

            # Set the number of iterations per training episode
            training {
                EpisodeIterationLimit: 1000
            }

            # Specify the goal state as two subgoals
            goal (State: AgentState) {
                avoid `Fall Over`: Math.Abs(State.pole_angle) in Goal.RangeAbove(MaxPoleAngle)
                avoid `Out of Range`: Math.Abs(State.cart_position) in Goal.RangeAbove(TrackLength / 2)
            }
        }
    }
}

# Connect the simulation visualizer to the web interface.
const SimulatorVisualizer = "/cartpoleviz/"

 

best response confirmed by Matthew Renze (MVP)
Solution
Matthew, we've investigated further and confirmed that you hit a bug in the logic that is supposed to auto-terminate training when the model converges (meets its objectives) or fails to make forward progress in a sufficiently long period of time.

When you hit the problem, we already had a fix for this bug and were testing it for deployment. The fix was deployed on March 5, which is apparently shortly after you were working on your CartPole experiments.

Thanks again for reporting the problem. Apologies for any inconvenience this caused you.
Eric,

Thanks for the update  

No worries at all. I'm just happy I was able to help you identify and resolve the issue.

Thanks,

Matthew