References:
 Course ppt by UIUC  Udacity Classes: TD(1) Rule, TD(1) Examples Part 1, 2, TD(0) Rule, TD(lampda), Summary MDP Formulation
An agent interacts with its environment via perception and action. In each cycle,
 The agent receives its current state (s),  The agent chooses an action (a) to be applied to itself,  The action leads nondeterministically to a new state (s') via interaction with the environment,  The state transition is typically associated with a scalar reward (r). The job of RL is to find a policy (π) that maps state (s) to action (a), that maximizes certain longterm reward. From these characterizations of the agent, it is obvious that the MDP formulation is a suitable mathematically model:  \(S\): finite set of states  \(s_0\): initial state \( s0 \in S \)  \(A\): finite set of actions  \(P\): transition function \( P(s′s, a) \) giving the probability of reaching \(s′\) by executing \(a\) in state \(s\)  \(r\): instantaneous reward function \(r(s'  s, a) \), or \(r(s)\) if reward is only state dependent. Longterm Rewards
Unlike instantaneous reward (r) that only captures the reward linked to a specific state or state transition, a longterm reward (R) specifies the reward that has certain temporal horizon (T). This is useful when used to characterizes how the agent should take the future into account in deciding the action it takes now:
Finitehorizon Rewards:
\(R = E(\sum_{t=0}^{T} r_t) \)
Interpretation 1: Effectively recedinghorizon control, the agent always takes the Tstep optimal action. Interpretation 2: The first step takes Toptimal action, on the next step, it takes a (T1)optimal, and so on, it takes the 1step optimal solution. This is effectively a formulation of differential dynamic programming (DDP).
Average Reward:
\(R = E(\sum_{t=0}^{\infty} \frac{r_t}{T})\)
Policy that optimizes this reward is known as gain optimal policy. The initial reward of a agent's life is overshadowed by the longrun average performance, optimizing this reward results in bias optimal policy, which prefers maximizing longrun average reward and ties are broken by the initial extra reward. It can be seen as the limiting case of the infinitehorizon discounted model as the discount factor approaches 1.
Delayed Reward:
Sometimes, the agent has to take a long sequence of actions, receiving insignificant reward until finally arriving at a state with high reward, like the goaloriented probabilistic planning. Under this setup, learning the actions are desirable based on reward that can take place arbitrarily far in the future is a challenging task.
Discounted Infinitehorizon Reward:
\(R = E(\sum_{t=0}^{\infty} {\gamma}^{t} \cdot r_t)\)
This reward is equivalent to the finitehorizon reward, but mathematically more tractable. In the field of optimal control, for example, discrete LQR policy, computing the finitehorizon involves solving:
Finitehorizon Dynamic Riccati Equation: $$A^T P_k A (A^T P_k B + N) (R+B^T P_k B)^{1}(B^T P_k A + N^T) + Q = P_{k1}$$
While a simpler stationary Riccati Equation is solved in computing the infinitehorizon discrete LQR policy policy,
Infinitehorizon Stationary Riccati Equation: $$A^T P A  A^T P B (R + B^T P B)^{1}B^T P A + Q = P$$
In fact, it can be shown that for the discounted infinitehorizon model, there exists an optimal deterministic stationary policy [1].
Definition of Optimal Policy w/ Known ModelBy "known model", we mean one has complete knowledge of the state transition function \( P(s′s, a) \) and the instantaneous reward function \(r(s'  s, a) \). Then the dynamic programming techniques [1] can be used to lay out the foundation of mathematical optimality representation. Without loss of generality, we use discounted infinitehorizon reward model. Given any policy \(\pi\), we can calculate a value function governed by this policy: $$V(s) = \underset{\pi}{\operatorname{max}} E(\sum_{t=0}^{\infty} \gamma^t r_t) $$ There is a unique optimal value function that can be defined as (a value associated with state \(s\)) the maximum of the sum of instantaneous reward plus the expected discounted value of the next state: $$V^*(s) = \underset{a}{\operatorname{max}} [r(s, a) + \gamma \sum_{s' \in S} P(s's,a) V^*(s')] $$ The definition optimal policy is tightly linked to the definition of optimal policy, by only taking the "argmax" instead of "max": $$\pi^*(s) = \underset{a}{\operatorname{argmax}} [r(s, a) + \gamma \sum_{s' \in S} P(s's,a) V^*(s')] $$ Sometimes, we are not only interested in the optimal value function or the optimal policy (only mapped to a given state \(s\)), but also interested in the mapping from a state, action pair to an optimal value, which is also known as the Q value. Q value is just the main body of optimization from the above the optimal value/policy function definition: $$Q^*(s, a) = r(s, a) + \gamma \sum_{s' \in S} P(s's,a) V^*(s') $$ Algorithms w/ Known Model
Policy Iteration Algorithms (Model Known):
Policy iteration algorithm starts with a policy \(\pi\). But instead of interweaving the updating of policy & value per state (as done by value iteration algorithm), it first solves the value function of the entire state space \(S\) associated with the policy \(V_{\pi}\), then it evaluates each state \(s\) and see if improvement can be made by choosing any different action. If so, update to get new policy \(\pi'\): Let \(\pi = \pi'\) Compute the value function of policy \(\pi\) by solving the equation $$V_{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s' \in S} [P(s's, \pi(s)) V_{\pi}(s')]$$ Improve the policy at each state: $$\pi'(s) = \underset{a}{argmax}\{ r(s, a) + \gamma \sum_{s' \in S} [P(s'  s, a) V_{\pi}(s')] \}$$ There are at most \(A^{S}\) distinct policies, and the sequence of policies is guaranteed to improve (or terminate) at each step. Therefore, the complexity of policy iteration algorithm is to the exponential order of state number.
Value Iteration Algorithms (Model Known):
Greedy FullBackup
Given a finiteelement state set \(S\) and finite action set \(A\), update the value at the (\(k+1\))th iteration:  Loop for \(s \in S\)  Loop for \(a \in A\)  Calculate \(Q^{k}(s, a) = r(s, a) + \gamma \sum_{s' \in S} P(s's,a) V^{k}(s')\) for each \(a\)  Update value by taking the maximum over all \(a\): \(V^{k+1}(s) = \underset{a}{\operatorname{max}} Q^{k}(s, a)\) The value iteration algorithm guarantees convergence to the correct \(V^*\) value. There is also a Bellman residual bound, which states that: if the maximum difference between two successive value functions is less than \(\epsilon\), then the value of the greedy policy differs from the value function of the optimal policy by no more than \(\frac{2\epsilon \gamma}{(1\gamma)}\)at any state. This type of value iteration is also called "greedy full backup algorithm" since: (1) the policy obtained by choosing, in every state, the action that maximizes the estimated discounted reward, using the current estimate of the value function. (2) making use of information from all possible successor states.
QLearning
Other than the "fullbackup" value iteration algorithm, there exists another class called "samplebackup" algorithm, a.k.a., Qlearning, which is critical to the operation of the modelfree methods. Rather than using the successor states, Qlearning tries to evaluate the greedy policy while following a more exploratory scheme, and update Qvalue as follows: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \gamma \underset{a'}{\operatorname{max}} [Q(s', a')  Q(s, a)] \}$$ with the following conditions met: (1) \(a\) and \(s\) is updated infinitely often. (2) \(s'\) is sampled from the distribution \(P(s's,a)\) when model is known, or, more often, execute & observe in the environment in modelfree approaches. (3) \(r\) is sampled with mean \(r(s,a)\) and bounded variance. (4) The learning rate \(\alpha\) decreases slowly.
"Qlearning is exploration insensitive: that is, that the Q values will converge to the optimal values, independent of how the agent behaves while the data is being collected (as long as all stateaction pairs are tried often enough).
This means that, although the explorationexploitation issue must be addressed in Qlearning, the details of the exploration strategy will not affect the convergence of the learning algorithm. For these reasons, Qlearning is the most popular and seems to be the most effective modelfree algorithm for learning from delayed reinforcement. It does not, however, address any of the issues involved in generalizing over large state and/or action spaces. In addition, it may converge quite slowly to a good policy." [2]
OnPolicy vs. OffPolicy Value Iteration Algorithms
Qlearning Value Update: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \tau \underset{a'}{\operatorname{max}} [Q(s', a')  Q(s, a)] \}$$
Qlearning is an offpolicy learner:
 It learns the value of the optimal policy independently of the agent's actions.  It estimates the return (total discounted future reward) for stateaction pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy. Stateactionrewardstateaction (SARSA) Algorithm: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \tau Q(s', a')  Q(s, a) \}$$
SARSA is an onpolicy learner:
 It learns the value of the policy being carried out by the agent including the exploration steps.  It estimates the return for stateaction pairs assuming the current policy continues to be followed. Algorithms w/ Unknown Model
Modelfree Algorithms:
Reinforcement learning is primarily concerned with how to obtain the optimal policy when model is not known in advance. Modelfree approaches learn a controller without learning a model by: interacting with its environment directly to obtain information which, by means of an appropriate algorithm, can be processed to produce an optimal policy.
Temporal credit assignment: "how do we know whether the action just taken is a good one, when it might have farreaching effects? One strategy is to wait until the end and reward the actions taken if the result was good and punish them if the result was bad. In ongoing tasks, it is difficult to know what the end is, and this might require a great deal of memory. Instead, temporal difference & (modelfree version) Qlearning algorithms use insights from value iteration to adjust the estimated value of a state based on the immediate reward and the estimated value of the next state." [2]
TD(1)
Episode \(T\) For all state \(s\), eligibility \(e(s) = 0\), at start of episode, \(V_T(s) = V_{T1}(s)\) In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t1} \to s_{t} \) with reward \(r_t\). $$e(s_{t1}) = e(s_{t1}) + 1$$ For all states \(s\), $$V_T(s) = V_T(s) + \alpha_{T} [r_t + \gamma V_{T1}(s_t)  V_{T1}(s_{t1})]e(s)$$ $$e(s) = \gamma e(s)$$
TD(0)
Episode \(T\) For all state \(s\), at start of episode, \(V_T(s) = V_{T1}(s)\) In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t1} \to s_{t} \) with reward \(r_t\). For the single state \(s = s_{t1}\) $$V_T(s) = V_T(s) + \alpha_T [r_t + \gamma V_{T}(s_t)  V_{T}(s_{t1})]$$
TD(Lambda)
Episode \(T\) For all state \(s\), eligibility \(e(s) = 0\), at start of episode, \(V_T(s) = V_{T1}(s)\) In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t1} \to s_{t} \) with reward \(r_t\). $$e(s_{t1}) = e(s_{t1}) + 1$$ For all states \(s\), $$V_T(s) = V_T(s) + \alpha_{T} [r_t + \gamma V_{T1}(s_t)  V_{T1}(s_{t1})]e(s)$$ $$e(s) = \lambda \gamma e(s)$$
Everything is identical to TD(1), except for the addition of parameter lambda in the last eligibility update step.
Modelbased Algorithms:
Modelbased methods intend to improve the modelfree approaches from the following disadvantages: inefficiency in gathering & using data & slow convergence. But this class of methods is arguably less important since many problems that we are facing have very complex models. It may be practically easier to use modelfree based approach supported by deep learning algorithms, such as actorcritic.
Citations
[1] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
[2] RL Survey by Kaelbling, 1996
0 Comments

AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories 