Based on 2008 AAAI: Probabilistic Planning via Determination in Hindsight
MDP formulation\(M = \{S, s_0, A, P, R\} \), with  \(S\): finite set of states  \(s_0\): initial state \( s0 \in S \)  \(A\): finite set of actions  \(P\): transition function \( P(s′s, a) \) giving the probability of reaching \(s′\) by executing \(a\) in state \(s\)  \(r\): reward function \(r(s'  s, a) \), or \(r(s)\) if reward is only state dependent. MDP determinizationA Thorizon future \(F : A \times S \times \{1, …, T\} \to [0, 1]\) Given \(M\) and \(F(a, s, t)\), \(M[F]\) denotes a determination of \(M\). This can be done by partitioning the range \([0, 1]\) for each possible outcome of action a based on its probability, and select the determined outcome by finding the corresponding partition that \(F(a, s, t)\) maps to. Value Function & QValue FunctionValue function maps a given state to an optimal value for a Thorizon MDP problem: $$V^*(s, T) = \underset{\pi}{\operatorname{max}} E[R(s, F, \pi)]$$ QValue function maps a given (state, action) pair to an optimal value for a Thorizon MDP problem: $$Q^*(s, a, T) = r(s) + \underset{\pi}{\operatorname{max}} E[V^*(s', T1)]$$ At a given state \(s\), choosing an action \(a\) that correspond to the maximum Qvalue is also known as Recedinghorizon Optimal Control, with: $$\pi^*(s, T) = \underset{a}{\operatorname{argmax}}(Q^*(s, a, T))$$ Hindsight OpitimizationTheoretically, computing true value function/Qvalue function may be achieved through value/policy (Qvalue) iteration algorithms, but they are sometimes intractable for practical problems. The hindsight optimization (HOP) algorithm attempts to approximate the true value function with: $$V_{hs}(s, T) = E[\underset{\pi}{\operatorname{max}} R(s, F, \pi)]$$ Correspondingly, the Qvalue can be found as: $$Q_{hs}(s, a, T) = r(s) + E[V_{hs}(s', T1)]$$ with HOP policy: $$\pi_{hs}(s, T) = \underset{a}{\operatorname{argmax}}(Q_{hs}(s, a, T))$$
0 Comments

AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories 