- A maze-like problem
- The agent lives in a grid
- Walls block the agent’s path
- Noisy movement: actions do not always go as planned
- 80% of the time, the action North takes the agent North
(if there is no wall there) - 10% of the time, North takes the agent West; 10% East
- If there is a wall in the direction the agent would have been taken, the agent stays put
- 80% of the time, the action North takes the agent North
- The agent receives rewards each time step
- Small punishment each step (negative reward)
- Big rewards come at the end states
- Goal: maximize sum of rewards
----------------------------------------------------------------------------------------------
- An MDP is defined by:
- A set of states s Î S
- A set of actions a Î A
- A transition function T(s, a, s’)
- Probability that a from s leads to s’, i.e., P(s’| s, a)
- Also called the model or the dynamics
- A reward function R(s, a, s’)
- Sometimes just R(s) or R(s’)
- A start state
- terminal state (optional)
- MDPs are non-deterministic search problems
- One way to solve them is with expectimax search
- We’ll have a new tool soon
----------------------------------------------------------------------------------------------
Value Iteration
沒有留言:
張貼留言