2019年12月19日 星期四

Markov Decision Process(MDP)

Example: Grid World












  • A maze-like problem
    • The agent lives in a grid
    • Walls block the agent’s path
  • Noisy movement: actions do not always go as planned
    • 80% of the time, the action North takes the agent North
      (if there is no wall there)
    • 10% of the time, North takes the agent West; 10% East
    • If there is a wall in the direction the agent would have been taken, the agent stays put
  • The agent receives rewards each time step
    • Small punishment each step (negative reward)
    • Big rewards come at the end states
  • Goal: maximize sum of rewards
----------------------------------------------------------------------------------------------
  • An MDP is defined by:
    • A set of states s Î S
    • A set of actions a Î A
    • A transition function T(s, a, s’)
    • Probability that a from s leads to s’, i.e., P(s’| s, a)
    • Also called the model or the dynamics
    • A reward function R(s, a, s’)
    • Sometimes just R(s) or R(s’)
    • A start state
    • terminal state (optional)
  • MDPs are non-deterministic search problems
    • One way to solve them is with expectimax search
    • We’ll have a new tool soon
----------------------------------------------------------------------------------------------
Value Iteration













沒有留言:

張貼留言

IKEA吊櫃廚櫃

 好不容易裝好IKEA買來的吊櫃,花了三天。 從組裝,鑽牆,上牆調水平,累死我了。