- Course ppt by UIUC

- Udacity Classes: TD(1) Rule, TD(1) Examples Part 1, 2, TD(0) Rule, TD(lampda), Summary

An agent interacts with its environment via perception and action. In each cycle,

- The agent receives its current state (*s)*,

- The agent chooses an action (*a)* to be applied to itself,

- The action leads non-deterministically to a new state (s') via interaction with the environment,

- The state transition is typically associated with a scalar reward (*r)*.

The job of RL is to find a policy (*π*) that maps state (*s*) to action (*a*), that maximizes certain long-term reward. From these characterizations of the agent, it is obvious that the MDP formulation is a suitable mathematically model:

- The agent receives its current state (

- The agent chooses an action (

- The action leads non-deterministically to a new state (s') via interaction with the environment,

- The state transition is typically associated with a scalar reward (

The job of RL is to find a policy (

- \(S\): finite set of states

- \(s_0\): initial state \( s0 \in S \)

- \(A\): finite set of actions

- \(P\): transition function \( P(s′|s, a) \) giving the probability of reaching \(s′\) by executing \(a\) in state \(s\)

- \(r\): instantaneous reward function \(r(s' | s, a) \), or \(r(s)\) if reward is only state dependent.

Unlike instantaneous reward (*r*) that only captures the reward linked to a specific state or state transition, a long-term reward (*R*) specifies the reward that has certain temporal horizon (*T*). This is useful when used to characterizes how the agent should take the future into account in deciding the action it takes now:

\(R = E(\sum_{t=0}^{T} r_t) \)

Interpretation 1: Effectively receding-horizon control, the agent always takes the T-step optimal action. Interpretation 2: The first step takes T-optimal action, on the next step, it takes a (T-1)-optimal, and so on, it takes the 1-step optimal solution. This is effectively a formulation of differential dynamic programming (DDP).

\(R = E(\sum_{t=0}^{\infty} \frac{r_t}{T})\)

Policy that optimizes this reward is known as gain optimal policy. The initial reward of a agent's life is overshadowed by the long-run average performance, optimizing this reward results in *bias optimal* policy, which prefers maximizing long-run average reward and ties are broken by the initial extra reward. It can be seen as the limiting case of the infinite-horizon discounted model as the discount factor approaches 1.

Sometimes, the agent has to take a long sequence of actions, receiving insignificant reward until finally arriving at a state with high reward, like the *goal-oriented probabilistic planning*. Under this setup, learning the actions are desirable based on reward that can take place arbitrarily far in the future is a challenging task.

\(R = E(\sum_{t=0}^{\infty} {\gamma}^{t} \cdot r_t)\)

This reward is equivalent to the finite-horizon reward, but mathematically more tractable. In the field of optimal control, for example, discrete LQR policy, computing the finite-horizon involves solving:

Finite-horizon Dynamic Riccati Equation: $$A^T P_k A -(A^T P_k B + N) (R+B^T P_k B)^{-1}(B^T P_k A + N^T) + Q = P_{k-1}$$

While a simpler stationary Riccati Equation is solved in computing the infinite-horizon discrete LQR policy policy,

Infinite-horizon Stationary Riccati Equation: $$A^T P A - A^T P B (R + B^T P B)^{-1}B^T P A + Q = P$$

In fact, it can be shown that for the discounted infinite-horizon model, there exists an __optimal deterministic stationary policy__ [1].

By "known model", we mean one has complete knowledge of the state transition function \( P(s′|s, a) \) and the instantaneous reward function \(r(s' | s, a) \). Then the dynamic programming techniques [1] can be used to lay out the foundation of mathematical optimality representation.

Without loss of generality, we use discounted infinite-horizon reward model. Given any policy \(\pi\), we can calculate a value function governed by this policy: $$V(s) = \underset{\pi}{\operatorname{max}} E(\sum_{t=0}^{\infty} \gamma^t r_t) $$

There is a unique optimal value function that can be defined as (a value associated with state \(s\)) the maximum of the sum of instantaneous reward plus the expected discounted value of the next state: $$V^*(s) = \underset{a}{\operatorname{max}} [r(s, a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s')] $$ The definition optimal policy is tightly linked to the definition of optimal policy, by only taking the "argmax" instead of "max": $$\pi^*(s) = \underset{a}{\operatorname{argmax}} [r(s, a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s')] $$ Sometimes, we are not only interested in the optimal value function or the optimal policy (only mapped to a given state \(s\)), but also interested in the mapping from a state, action pair to an optimal value, which is also known as the Q value. Q value is just the main body of optimization from the above the optimal value/policy function definition: $$Q^*(s, a) = r(s, a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s') $$

Policy iteration algorithm starts with a policy \(\pi\). But instead of interweaving the updating of policy & value per state (as done by value iteration algorithm), it first solves the value function of the entire state space \(S\) associated with the policy \(V_{\pi}\), then it evaluates each state \(s\) and see if improvement can be made by choosing any different action. If so, update to get new policy \(\pi'\):

Let \(\pi = \pi'\)

Compute the value function of policy \(\pi\) by solving the equation $$V_{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s' \in S} [P(s'|s, \pi(s)) V_{\pi}(s')]$$

Improve the policy at each state: $$\pi'(s) = \underset{a}{argmax}\{ r(s, a) + \gamma \sum_{s' \in S} [P(s' | s, a) V_{\pi}(s')] \}$$ There are at most \(|A|^{|S|}\) distinct policies, and the sequence of policies is guaranteed to improve (or terminate) at each step. Therefore, the complexity of policy iteration algorithm is to the exponential order of state number.

Given a finite-element state set \(S\) and finite action set \(A\), update the value at the (\(k+1\))-th iteration:

- Loop for \(s \in S\)

- Loop for \(a \in A\)

- Calculate \(Q^{k}(s, a) = r(s, a) + \gamma \sum_{s' \in S} P(s'|s,a) V^{k}(s')\) for each \(a\)

- Update value by taking the maximum over all \(a\): \(V^{k+1}(s) = \underset{a}{\operatorname{max}} Q^{k}(s, a)\)

The value iteration algorithm guarantees convergence to the correct \(V^*\) value. There is also a Bellman residual bound, which states that: if the maximum difference between two successive value functions is less than \(\epsilon\), then the value of the greedy policy differs from the value function of the optimal policy by no more than \(\frac{2\epsilon \gamma}{(1-\gamma)}\)at any state.

This type of value iteration is also called "greedy full backup algorithm" since:

(1) the policy obtained by choosing, in every state, the action that maximizes the estimated discounted reward, using the current estimate of the value function.

(2) making use of information from all possible successor states.

Other than the "full-backup" value iteration algorithm, there exists another class called "sample-backup" algorithm, a.k.a., Q-learning, which is critical to the operation of the model-free methods. Rather than using the successor states, Q-learning tries to evaluate the greedy policy while following a more exploratory scheme, and update Q-value as follows: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \gamma \underset{a'}{\operatorname{max}} [Q(s', a') - Q(s, a)] \}$$ with the following conditions met:

(1) \(a\) and \(s\) is updated infinitely often.

(2) \(s'\) is sampled from the distribution \(P(s'|s,a)\) when model is known, or, more often, execute & observe in the environment in model-free approaches.

(3) \(r\) is sampled with mean \(r(s,a)\) and bounded variance.

(4) The learning rate \(\alpha\) decreases slowly.

"Q-learning is *exploration insensitive*: that is, that the Q values will __converge to the optimal values, independent of how the agent behaves while the data is being collected__ (as long as all state-action pairs are tried often enough).

This means that, although the exploration-exploitation issue must be addressed in Q-learning,__the details of the exploration strategy will not affect the convergence of the learning algorithm.__

For these reasons, Q-learning is the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. It does not, however, address any of the issues involved in generalizing over large state and/or action spaces. In addition, it may converge quite slowly to a good policy." [2]

This means that, although the exploration-exploitation issue must be addressed in Q-learning,

For these reasons, Q-learning is the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. It does not, however, address any of the issues involved in generalizing over large state and/or action spaces. In addition, it may converge quite slowly to a good policy." [2]

Q-learning Value Update: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \tau \underset{a'}{\operatorname{max}} [Q(s', a') - Q(s, a)] \}$$

Q-learning is an **off-policy learner**:

- It learns the value of the optimal policy*independently of the agent's actions*.

- It estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

- It learns the value of the optimal policy

- It estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

State-action-reward-state-action (SARSA) Algorithm: $$Q^{k+1}(s, a) = Q^{k}(s, a) + \alpha \{ r + \tau Q(s', a') - Q(s, a) \}$$

SARSA is an **on-policy learner**:

- It learns the value of the policy being carried out by the agent including the exploration steps.

- It estimates the return for state-action pairs assuming the current policy continues to be followed.

- It learns the value of the policy being carried out by the agent including the exploration steps.

- It estimates the return for state-action pairs assuming the current policy continues to be followed.

Reinforcement learning is primarily concerned with how to obtain the optimal policy when model is not known in advance. **Model-free** approaches learn a controller without learning a model by: *interacting with its environment directly to obtain information which, by means of an appropriate algorithm, can be processed to produce an optimal policy.*

Temporal credit assignment: "how do we know whether the action just taken is a good one, when it might have far-reaching effects? One strategy is to__wait until the end and reward the actions taken if the result was good and punish them if the result was bad__. In ongoing tasks, it is __difficult to know what the end is__, and this might require a great deal of memory. Instead, **temporal difference & (model-free version) Q-learning algorithms** use insights from value iteration to __adjust the estimated value of a state based on the immediate reward and the estimated value of the next state.__" [2]

Temporal credit assignment: "how do we know whether the action just taken is a good one, when it might have far-reaching effects? One strategy is to

Episode \(T\)

For all state \(s\), eligibility \(e(s) = 0\), at start of episode, \(V_T(s) = V_{T-1}(s)\)

In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t-1} \to s_{t} \) with reward \(r_t\).

$$e(s_{t-1}) = e(s_{t-1}) + 1$$

For all states \(s\), $$V_T(s) = V_T(s) + \alpha_{T} [r_t + \gamma V_{T-1}(s_t) - V_{T-1}(s_{t-1})]e(s)$$ $$e(s) = \gamma e(s)$$

Episode \(T\)

For all state \(s\), at start of episode, \(V_T(s) = V_{T-1}(s)\)

In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t-1} \to s_{t} \) with reward \(r_t\).

For the single state \(s = s_{t-1}\) $$V_T(s) = V_T(s) + \alpha_T [r_t + \gamma V_{T}(s_t) - V_{T}(s_{t-1})]$$

Episode \(T\)

For all state \(s\), eligibility \(e(s) = 0\), at start of episode, \(V_T(s) = V_{T-1}(s)\)

In this episode, from a starting state, perform forward simulation, achieve a sequence of state transitions, at each time \(t\), \(s_{t-1} \to s_{t} \) with reward \(r_t\).

$$e(s_{t-1}) = e(s_{t-1}) + 1$$

For all states \(s\), $$V_T(s) = V_T(s) + \alpha_{T} [r_t + \gamma V_{T-1}(s_t) - V_{T-1}(s_{t-1})]e(s)$$ $$e(s) = \lambda \gamma e(s)$$

Everything is identical to TD(1), except for the addition of parameter lambda in the last eligibility update step.

Model-based methods intend to improve the model-free approaches from the following disadvantages: inefficiency in gathering & using data & slow convergence. But this class of methods is arguably less important since many problems that we are facing have very complex models. It may be practically easier to use model-free based approach supported by deep learning algorithms, such as actor-critic.

[1] Richard Bellman. *Dynamic Programming*. Princeton University Press, Princeton, NJ, 1957.

[2] RL Survey by Kaelbling, 1996

]]>[2] RL Survey by Kaelbling, 1996