Reward of 1000 is given for finding the goal (black location), and a penalty of -1000 for attempting to move off the map. Rewards are considered to arrive immediately after the action. A random action is taken 10% of the time (i.e. epsilon greedy strategy with an epsilon of 0.1). The discount factor is 0.9, and the learning rate is 0.1. The mover teleports to the starting (white) location after reaching the goal. Try changing the start and goal locations after the agent has already learned a solution and see how well it can adjust!
State | N | E | S | W |
---|---|---|---|---|
A | 0 | 0 | 0 | 0 |
B | 0 | 0 | 0 | 0 |
C | 0 | 0 | 0 | 0 |
D | 0 | 0 | 0 | 0 |
E | 0 | 0 | 0 | 0 |
F | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 |
H | 0 | 0 | 0 | 0 |
I | 0 | 0 | 0 | 0 |