SOLVED

how is the reward function calculated internally?

%3CLINGO-SUB%20id%3D%22lingo-sub-1940851%22%20slang%3D%22en-US%22%3Ehow%20is%20the%20reward%20function%20calculated%20internally%3F%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1940851%22%20slang%3D%22en-US%22%3E%3CDIV%3EI%20am%20using%20an%20AnyLogic%20model...%20and%20I%20want%20to%20know%20how%20to%20define%20the%20reward%20function%20correctly.%20And%20for%20this%20i%20need%20to%20understand%26nbsp%3Bthe%20underlying%20technique%2C%20since%20there%20are%202%20different%20approaches%3C%2FDIV%3E%3CDIV%3E%26nbsp%3B%3C%2FDIV%3E%3CDIV%3Eif%20I%20say%20in%20inkling%3C%2FDIV%3E%3CDIV%3Ereward%20%3D%20obs.something%3B%3C%2FDIV%3E%3CDIV%3E%26nbsp%3B%3C%2FDIV%3E%3CDIV%3Ewhat%20does%20bonsai%20do%20on%20the%20background%3F%20There%20are%202%20options%3C%2FDIV%3E%3CDIV%3E%3CBR%20%2F%3EOPTION%201%3A%3CBR%20%2F%3Eit%20takes%20the%20difference%20between%20the%20last%20reward%20and%20the%20current%20reward%2C%20meaning%20that%20if%20reward%3Dobs.something%20it%20is%20actually%20calculating%20the%20difference%20between%20the%20current%20obs.something%20and%20the%20previous%20one%2C%20and%20this%20is%20accumulated%20internally%3C%2FDIV%3E%3CDIV%3E%26nbsp%3B%3C%2FDIV%3E%3CDIV%3EOPTION%202%3CBR%20%2F%3Ebonsai%20ignores%20any%20previous%20reward%20and%20when%20i%20do%20reward%3Dobs.something%20what%20it's%20really%20doing%20it's%20just%20using%20obs.something%20and%20internally%20adding%20that%20reward.%3CBR%20%2F%3E%3CBR%20%2F%3Eoption%201%20requires%20me%20to%20create%20accumulated%20rewards%20in%20anylogic%2C%20OPTION%202%20requires%20me%20to%20created%20rewards%20that%20DO%20NOT%20accumulate%20in%20anylogic%3CBR%20%2F%3E%3CBR%20%2F%3EWhat%20is%20the%20right%20way%20to%20do%20it%3F%3C%2FDIV%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-1940933%22%20slang%3D%22en-US%22%3ERe%3A%20how%20is%20the%20reward%20function%20calculated%20internally%3F%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1940933%22%20slang%3D%22en-US%22%3E%3CP%3EThe%20reward%20that%20you%20calculate%20for%20each%20iteration%20is%20assumed%20by%20the%20platform%20to%20be%20specific%20to%20that%20iteration%2C%20not%20a%20cumulative%20value.%20As%20the%20policy%20is%20trained%2C%20it%20will%20attempt%20to%20maximize%20the%20cumulative%20rewards%20for%20each%20episode.%20In%20other%20words%2C%20you%20should%20assume%20OPTION%202.%3C%2FP%3E%3C%2FLINGO-BODY%3E
Occasional Contributor
I am using an AnyLogic model... and I want to know how to define the reward function correctly. And for this i need to understand the underlying technique, since there are 2 different approaches
 
if I say in inkling
reward = obs.something;
 
what does bonsai do on the background? There are 2 options

OPTION 1:
it takes the difference between the last reward and the current reward, meaning that if reward=obs.something it is actually calculating the difference between the current obs.something and the previous one, and this is accumulated internally
 
OPTION 2
bonsai ignores any previous reward and when i do reward=obs.something what it's really doing it's just using obs.something and internally adding that reward.

option 1 requires me to create accumulated rewards in anylogic, OPTION 2 requires me to created rewards that DO NOT accumulate in anylogic

What is the right way to do it?
1 Reply
best response confirmed by felipeharo100 (Occasional Contributor)
Solution

The reward that you calculate for each iteration is assumed by the platform to be specific to that iteration, not a cumulative value. As the policy is trained, it will attempt to maximize the cumulative rewards for each episode. In other words, you should assume OPTION 2.