oscar: Markov Decision Process(MDP)

2019年12月19日星期四

Markov Decision Process(MDP)

Example: Grid World

A maze-like problem
- The agent lives in a grid
- Walls block the agent’s path
Noisy movement: actions do not always go as planned
- 80% of the time, the action North takes the agent North
  (if there is no wall there)
- 10% of the time, North takes the agent West; 10% East
- If there is a wall in the direction the agent would have been taken, the agent stays put
The agent receives rewards each time step
- Small punishment each step (negative reward)
- Big rewards come at the end states
Goal: maximize sum of rewards

----------------------------------------------------------------------------------------------

An MDP is defined by:
- A set of states s Î S
- A set of actions a Î A
- A transition function T(s, a, s’)
- Probability that a from s leads to s’, i.e., P(s’| s, a)
- Also called the model or the dynamics
- A reward function R(s, a, s’)
- Sometimes just R(s) or R(s’)
- A start state
- terminal state (optional)
MDPs are non-deterministic search problems
- One way to solve them is with expectimax search
- We’ll have a new tool soon

----------------------------------------------------------------------------------------------

Value Iteration

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)