14 Sep 2025
Theory and Methods for Reinforcement Learning — Lec.2
Theory and Methods for Reinforcement Learning — Lec.2
Keypoints:
- Markov Chains
- Markov Decision Processes (MDPs)
- Policies
- Performance Criteria
- Value Functions
- Occupancy Measure
- Bellman Equations
- Solving MDPs: VI & PI
- From Planning to RL
- Monte Carlo vs TD Learning
Markov Chains Refresher
Definition: Markov Chain
- A time-homogeneous Markov chain is a stochastic process ${X_0, X_1, \dots}$ on a countable state space satisfying the Markov property:
\[\mathbb{P}(X_{t+1}=j\mid X_t=i, X_{t-1},\dots,X_0) = \mathbb{P}(X_{t+1}=j\mid X_t=i) = P_{ij}.\]
Markov Process (Tuple)
- $\langle \mathcal{S}, P, \mu \rangle$:
- State set: $\mathcal{S}$
- Transition model: $P_{ss’} = \mathbb{P}(s’\mid s): \mathcal{S} \to \Delta(\mathcal{S})$
- Initial distribution: $s_0 \sim \mu \in \Delta(\mathcal{S})$
Stationary Distribution (Ergodic Case)
- If the chain is irreducible and aperiodic with finite states (ergodic), there exists a unique stationary distribution $d^*$ and ${X_t}$ converges to it:
\[\lim_{t\to\infty} \big(P^t\big)_{ij} = d^*_j,\quad \forall i,j.\]
- Vector form: with $P$ the transition matrix and $d^*$ a row vector,
\[d^* = d^* P,\quad \text{i.e., } d^* \text{ is the left principal eigenvector of } P.\]
Difference with Markov Property: 虽然 Markov Property 只依赖当前状态,但如果你从不同的初始状态出发,短期内分布是不同的。但是在 遍历 (ergodic) 的条件下,随着 $t \to \infty$,这些由不同初始状态引起的差异会逐渐消失,最终所有状态的分布都会收敛到同一个 stationary distribution。
Markov Decision Processes (MDPs)
Controlled Markov Property
\(\mathbb{P}(s_{t+1}=s'\mid s_t=s, a_t=a, \dots, s_0, a_0) = \mathbb{P}(s_{t+1}=s'\mid s_t=s, a_t=a) = P(s'\mid s,a).\)
Definition: MDP
- Tuple $(\mathcal{S}, \mathcal{A}, P, r, \mu, \gamma)$:
- States: $\mathcal{S}$
- Actions: $\mathcal{A}$
- Transition: $P(s’\mid s,a): \mathcal{S}\times \mathcal{A} \to \Delta(\mathcal{S})$
- Reward: $r(s,a): \mathcal{S}\times \mathcal{A} \to \mathbb{R}$
- Initial distribution: $s_0\sim \mu\in\Delta(\mathcal{S})$
- Discount factor: $\gamma\in(0,1)$
Policies
Definition: Policy
- A policy maps histories $h_t=(s_0,a_0,\dots,s_{t-1},a_{t-1},s_t)$ to actions.
- Deterministic policy:
- Stationary policy: $ \pi:\mathcal{S}\to\mathcal{A}, \ a_t=\pi(s_t)$
- Markov policy: $ \pi_t:\mathcal{S}\to\mathcal{A}, \ a_t=\pi_t(s_t)$
- History-dependent policy: $ \pi_t:\mathcal{H}_t\to\mathcal{A}, \ a_t=\pi_t(h_t)$
- Stochastic policy:
- Stationary policy: $ \pi:\mathcal{S}\to\Delta(\mathcal{A}), \ a_t\sim\pi(\cdot\mid s_t)$
- Markov policy: $ \pi_t:\mathcal{S}\to\Delta(\mathcal{A}), \ a_t\sim\pi_t(\cdot\mid s_t)$
- History-dependent policy: $ \pi_t:\mathcal{H}_t\to\Delta(\mathcal{A}), \ a_t\sim\pi_t(\cdot\mid h_t)$
- Infinite-horizon discounted objectives admit an optimal stationary deterministic policy.
- Finite-horizon objectives generally require a (nonstationary) deterministic Markov policy.
Reminder: Objective Choices
- Finite horizon: cumulative reward; average reward.
- Infinite horizon: discounted reward; average reward.
Discounted Return
\(J(\pi) = \mathbb{E}\Big[ \sum_{t=0}^{\infty} \gamma^{t}\, r(s_t, a_t) \;\Big|\; s_0\sim\mu,\; a_t\sim\pi(\cdot\mid s_t),\; s_{t+1}\sim P(\cdot\mid s_t,a_t) \Big].\)
Observations
- If $\gamma=1$, total reward can diverge (e.g., cyclic processes).
- With $\gamma\in(0,1)$ and bounded rewards ($\lvert r\rvert<\infty$), the return is finite.
Value Functions Definitions
State-Value Function
\(V^{\pi}(s) := \mathbb{E}\!\left[ \sum_{t=0}^{\infty} \gamma^t\, r(s_t,a_t) \;\middle|\; s_0=s,\, \pi \right].\)
Action-Value Function (Q-function)
\(Q^{\pi}(s,a) := \mathbb{E}\!\left[ \sum_{t=0}^{\infty} \gamma^t\, r(s_t,a_t) \;\middle|\; s_0=s,\, a_0=a,\, \pi \right].\)
Relations between $V^{\pi}$ and $Q^{\pi}$
- For any policy $\pi: \mathcal{S}\to\Delta(\mathcal{A})$:
\[Q^{\pi}(s,a) = r(s,a) + \gamma \sum_{s'\in\mathcal{S}} P(s'\mid s,a)\, V^{\pi}(s'), \tag{1}\] \[V^{\pi}(s) = \sum_{a\in\mathcal{A}} \pi(a\mid s)\, Q^{\pi}(s,a). \tag{2}\]
Occupancy Measure Definition
For initial distribution $\mu$ and policy $\pi$, the (discounted) occupancy measure is \(\lambda^{\pi}_{\mu}(s,a) = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t\, \mathbb{P}[\, s_t=s,\ a_t=a\mid s_0\sim\mu,\, \pi\,].\)
Intuition: $\lambda^{\pi}_{\mu}(s,a)$ represents the (normalized) discounted frequency of visiting state-action pair $(s,a)$ when starting from initial distribution $\mu$ and following policy $\pi$.
Relation to Value
\(V^{\pi}(\mu) = \frac{1}{1-\gamma} \sum_{s,a} \lambda^{\pi}_{\mu}(s,a)\, r(s,a) = \frac{<\lambda^{\pi}_{\mu}, r>}{1-\gamma}.\)
本质上, 衡量从状态 $s \sim \mu$ 开始并遵循策略 $\pi$ 时的预期累计奖励的$V^{\pi}(s)$就是每个状态-动作pair $(s,a)$ 的奖励 $r(s,a)$ 按照其在轨迹中出现的频率(由occupancy $\lambda^{\pi}_{\mu}(s,a)$ 给出)加权求和
Optimal Value Functions
- Optimal state-value function: \(V^*(s) := \max_{\pi \in \Pi} V^{\pi}(s)\)
- Optimal action-value function: \(Q^*(s,a) := \max_{\pi \in \Pi} Q^{\pi}(s,a)\)
- $V^$ and $Q^$ satisfy the Bellman optimality equations: \(V^*(s) = \max_{a\in\mathcal{A}} Q^*(s,a) \\ Q^*(s,a) = r(s,a) + \gamma \sum_{s'\in\mathcal{S}} P(s'\mid s,a)\, V^*(s').\)
Solving MDPs: find the optimal policy
- 目标:寻找最优策略 $\pi^\star$ 使得 [ V^{\pi^\star}(s)=V^\star(s):=\max_{\pi} V^\pi(s),\quad \forall s\in\mathcal S. ] 注:最优策略可能不唯一,但 $V^\star$ 唯一
Bellman Optimality Conditions
[ V^\star(s)=\max_{a\in\mathcal A}\Big[r(s,a)+\gamma\sum_{s’}P(s’|s,a)V^\star(s’)\Big]. ]
Existence of an Optimal Policy
- Infinite horizon MDP 中,存在deterministic and stationary的最优策略 $\pi^\star$,使得 [ V^{\pi^\star}(s)=V^\star(s),\quad Q^{\pi^\star}(s,a)=Q^\star(s,a). ] moreover: [ \pi^\star(s)=\arg\max_{a\in\mathcal A} Q^\star(s,a). ]
Bellman Consistency
- [ V^\pi(s)=\mathbb E_{a\sim\pi(\cdot|s)}!\Big[r(s,a)+\gamma\sum_{s’}P(s’|s,a)V^\pi(s’)\Big]. ]
- [ V_\pi=R_\pi+\gamma P_\pi V_\pi, ] where:
$V_\pi\in\mathbb R^{n}$, $ (V_\pi)s=V^\pi(s)$;
$R\pi\in\mathbb R^{n}$, $ (R_\pi)s=\sum{a}\pi(a|s)r(s,a)$;
$P_\pi\in\mathbb R^{n\times n}$, $ (P_\pi){s,s’}=\sum{a}\pi(a|s)P(s’|s,a)$。
- [ V_\pi=(I-\gamma P_\pi)^{-1}R_\pi. ] Compt. cost is $\mathcal O(|\mathcal S|^3+|\mathcal S|^2|\mathcal A|)$。
Bellman Expectation Operator and Fixed-point Persepective
- Let $\Gamma^\pi:\mathbb R^{|\mathcal S|}\to\mathbb R^{|\mathcal S|}$, [ \Gamma^\pi V:=R_\pi+\gamma P_\pi V. ]
- $V_\pi$ 是 $\Gamma^\pi$ 的 fixed point, i.e., $V_\pi=\Gamma^\pi V_\pi$.
- $\Gamma^\pi$ is the iteration operator for policy evaluation.
Bellman Optimality Equations & Operator
-
[ \begin{aligned} V^\star(s)&=\max_{a}\Big[r(s,a)+\gamma\sum_{s’}P(s’|s,a)V^\star(s’)\Big],
Q^\star(s,a)&=r(s,a)+\gamma\sum_{s’}P(s’|s,a)\max_{a’}Q^\star(s’,a’). \end{aligned} ]
-
Bellman optimality operator $\Gamma:\mathbb R^{|\mathcal S|}\to\mathbb R^{|\mathcal S|}$, [ (TV)(s)=\max_{a}\Big[r(s,a)+\gamma\sum_{s’}P(s’|s,a)V(s’)\Big], ] $\Gamma$代表一种运算,也即算子,$V^\star$ 是 $\Gamma$ 的不动点
Contraction of $\Gamma$
- for any $V,V’\in\mathbb R^{|\mathcal S|}$, [ |\Gamma V’ - \Gamma V|\infty \le \gamma |V’ - V|\infty. ]
Solving MDP:Value Iteration 与 Policy Iteration
Value Iteration (VI)
- 算法(Fiexed-point iteration): 1) 选取初值
V_0
(如 V_0(s)=0
)。 2) 迭代更新: \(V_{t+1} = \Gamma\,V_t.\) 3) 直到收敛或满足终止条件(如 $|V_{t+1}-V_t|_\infty \le \varepsilon$)。 - Find $\pi^$ from $V^$: \(\pi^*(s) = \arg\max_{a\in\mathcal A} \Big[ r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V(s') \Big].\)
- Alternatively: 直接在 $Q$-空间迭代: \(Q_{t+1}(s,a) = r(s,a) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s,a) \max_{a'\in\mathcal A} Q_t(s',a'),\quad \pi(s) \in \arg\max_{a} Q_t(s,a).\)
Why we use $Q$ here? Because $Q^{\pi}(s,a)$ directly gives the value of taking action $a$ in state $s$.
- \(Q^{\pi}(s,a) = r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^{\pi}(s'),\tag{1}\) \(V^{\pi}(s) = \sum_{a\in\mathcal A} \pi(a\mid s)\, Q^{\pi}(s,a).\tag{2}\)
- 线性收敛: \(\|V_t - V^*\|_\infty \le \gamma^t\, \|V_0 - V^*\|_\infty.\)
Value iteration first finds the optimal value function $V^$ (or $Q^$), then derives the optimal policy $\pi^*$ by greedy selection.
Policy Improvement
- Intuition: Start from an initial policy $\pi$,iterately perform:
- Evaluate $V^{\pi}$ (or $Q^{\pi}$) for current policy $\pi$.
- Improve policy $\pi$ by making it greedy w.r.t. $V^{\pi}$ (or $Q^{\pi}$).
- 若存在策略 $\pi’$ 使得对所有 $s\in\mathcal S$, \(Q^{\pi}(s,\pi'(s)) \ge V^{\pi}(s),\) 则 $V^{\pi’}(s) \ge V^{\pi}(s)$ 对任意 $s$ 成立;若 $\pi$ 非最优,则不等式变为$>$,严格成立
- 推论:可用“对 $V^{\pi}$ 贪心”的方式改进策略: \(\pi_{t+1}(s) \in \arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a).\)
Policy Iteration (PI)
- 算法(直接在策略空间迭代): 1) 初始化策略 $\pi_0$。 2) 对 $t=0,1,2,\dots$ 重复:
- 策略评估(求 $V^{\pi_t}$):
- 迭代法:$V \leftarrow \Gamma^{\pi_t} V$ 直至收敛,其中
- 闭式解:$ V^{\pi_t} = (I - \gamma P^{\pi_t})^{-1} R^{\pi_t}. $
- 策略提升(贪心更新): \(\pi_{t+1}(s) \in \arg\max_{a\in\mathcal A} [ r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^{\pi_t}(s') ].\)
- 线性收敛: \(\|V^{\pi_t} - V^*\|_\infty \le \gamma^t\, \|V^{\pi_0} - V^*\|_\infty.\)
VI vs. PI
Algorithm | Value Update | Policy Update |
Value Iteration (VI) | $V_{t+1} = \mathcal{T} V_t$ | None |
Policy Iteration (PI) | $V_{t+1} = \mathbb{E}{\pi_t}!\left[ r(s,a) + \gamma \sum{s’ \in S} P(s’ \mid s,a)\, V(s’) \right]$ | Greedy Policy |
Algorithm | Per iteration cost | Number of iterations | Output |
Value Iteration (VI) | $\mathcal{O}(|S|^2 |A|)$ | $T = \mathcal{O}\left(\frac{\log(\epsilon(1-\gamma))}{\log \gamma}\right)$ | $V_T$ such that $|V_T - V^*| \leq \epsilon$ |
Policy Iteration (PI) | $\mathcal{O}(|S|^3 + |S|^2 |A|)$ | $T = \mathcal{O}\left(\frac{|S|(|A|-1)}{1-\gamma}\right)$ | $V^$ and $\pi^$ |
- PI converges in a finite number of iterations, whereas VI does not.
- These solution mythologies are broadly known as model-based RL.
From Planning to RL
Fundamental Challenges:
- VI and PI require knowledge of the transition model $P$ and reward function $r$.
- Needs sampling approaches
- The computation and memory costs grow rapidly with the size of the state and action spaces (the “curse of dimensionality”).
- Needs new representations
Categories of RL Algorithms:
- Model-based RL: Learn the model $P$ and $r$, then plan using VI or PI.
-
Experience ${(s,a,s’,r)}$ → Learn transition $P(s’ | s,a)$ and reward $r(s,a)$ → Policy $\pi(a | s)$ |
- Model-free RL: Directly learn the policy $\pi$ or value functions $V^\pi$, $Q^\pi$ from experience without explicitly modeling $P$ and $r$.
-
Experience ${(s,a,s’,r)}$ → Learn policy $\pi(a | s)$ or value functions $V^\pi(s)$, $Q^\pi(s,a)$ → Action selection $\pi(a | s)$ |
- On-policy vs. Off-policy:
- On-policy: Collect data by interacting with the environment. Exploritation-exploration trade-off.
- Off-policy: Learn from a fixed dataset of experiences.
Representation Learning:
- Use function approximators (e.g., neural networks) to represent policies and value functions.
Mean Estimation
- 目标:给定样本序列 $X_1, X_2, …, X_n$,估计均值 $\mu=\mathbb E[X]$。
- Sampl Average): \(\hat\mu_n = \frac{1}{n}\sum_{i=1}^n X_i. \\ \hat\mu_n = \hat\mu_{n-1} + \frac{1}{n}\big(X_n - \hat\mu_{n-1}\big).\)
- Stochastic Approximation: \(\mu_{n+1} = \mu_n + \alpha_n\big(X_{n+1} - \mu_n\big),\quad n=0,1,2,\dots\) where $\alpha_n$ is the step size.
Model-free Prediction
- Goal: Under a given policy $\pi$, estimate $V^{\pi}(s)$ or $Q^{\pi}(s,a)$ only from trajectories (episodes): \(V^{\pi}(s) := \mathbb{E}\Big[\sum_{t=0}^{\infty} \gamma^t\, r(s_t,a_t)\,\Big|\, s_0=s\Big].\)
Monte Carlo (MC) Method
- Idea: Use the mean of returns $G_t = r_t + \gamma r_{t+1} + \dots$ observed after visiting $s$ to estimate $V^{\pi}(s)$.
- Algorithm (pseudo-code):
1) For each episode $\tau={s_0,a_0,r_0,s_1,\dots}$ generated by $\pi$:
2) For each state visit $s_t$ in the episode: - Compute return $G_t = r_t + \gamma r_{t+1} + \cdots$
- Update visit count $n_{s_t} \leftarrow n_{s_t}+1$
- Update value: \(V(s_t) \leftarrow V(s_t) + \frac{1}{n_{s_t}}\big(G_t - V(s_t)\big).\)
- Key points:
- No model required; estimates are independent across states; slow for long episodes.
- Convergence: if each state is visited infinitely often, then $V \to V^{\pi}$.
Temporal Difference (TD) Learning
- From Bellman consistency: \(V^{\pi}(s) = \mathbb{E}_{a\sim\pi(\cdot|s)}\Big[r(s,a) + \gamma\, \mathbb{E}_{s'\sim P(\cdot|s,a)} V^{\pi}(s')\Big].\)
- Online incremental update: \(V(s_t) \leftarrow V(s_t) + \alpha_t \underbrace{(r_t + \gamma V(s_{t+1}) - V(s_t))}_{\text{TD error:=} \delta_t}\)
- Algorithm (pseudo-code):
1) For each step $t$ in episode $\tau={s_0,a_0,r_0,s_1,\dots}:
1) Observe $(s_t,a_t,r_t,s_{t+1})$ sampled by $\pi$.
2) Update $V$ by the above rule. - Uses one-step bootstrap, low variance but biased target.
- Model-free; works with incomplete episodes; suitable for continuing tasks.
- Convergence: if each state is visited infinitely often and $\alpha_t$ satisfies Robbins–Monro conditions, then $V \to V^{\pi}$.
Comparison: DP vs MC vs TD
- DP: no sampling, exploit Markov property.
- MC: sampling, model-free, no Markov property. Unbiased, higher variance as it relies many random steps.
- TD: sampling, model-free, exploit Markov property, online. Biased, lower variance as it only relies on next step.
MC error can be written as the sum of TD errors over the episode: \(\begin{aligned} G_t - V(s_t) &= r_{t} + \gamma G_{t+1} - V(s_t) \\ &= r_{t} + \gamma G_{t+1} - V(s_t) + \gamma V(s_{t+1}) - \gamma V(s_{t+1}) \\ &= r_{t} + \gamma V(s_{t+1}) - V(s_t) + \gamma (G_{t+1} - V(s_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+2} - V(s_{t+2})) \\ &= ... \\ &= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k. \end{aligned}\)
Multi-step TD: n-step Returns and Updates
- Definition (given episode termination time $T$): \(\begin{aligned} G_t^{(1)} &= r_{t+1} + \gamma V(s_{t+1}) \quad &\text{(TD(0) one-step)}\\ G_t^{(2)} &= r_{t+1} + \gamma r_{t+2} + \gamma^2 V(s_{t+2})\\ G_t^{(n)} &= r_{t+1} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n}),\\ G_t^{(\infty)} &= r_{t+1} + \cdots + \gamma^{T-t-1} r_T \quad &\text{(MC, when } t+n\ge T) \end{aligned}\)
- n-step TD update: \(V(s_t) \leftarrow V(s_t) + \alpha_t \underbrace{\big(G_t^{(n)} - V(s_t)\big)}_{\text{n-step TD error}}.\)
TD(λ) and Eligibility Traces
- λ-return (geometric average of all n-step returns): \(G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1}\, G_t^{(n)}.\)
- Update (forward view): \(V(s_t) \leftarrow V(s_t) + \alpha\,\big(G_t^{\lambda} - V(s_t)\big).\)
- Backward view (efficient implementation):
- TD error: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$; eligibility trace: \(e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf 1\{s_t=s\}.\)
- Parameter update: \(V(s) \leftarrow V(s) + \alpha\, \delta_t\, e_t(s),\quad \forall s.\)
- Properties: $\lambda=0$ reduces to TD(0); $\lambda=1$ approximates MC; intermediate $\lambda$ often improves convergence.
SARSA (On-policy TD Control for Q-estimation)
- Goal: Estimate $Q^{\pi}$ in policy evaluation/improvement based on interaction samples.
- Update: \(Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha\, \Big[r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t)\Big].\)
- Pseudo-code:
1) Observe $(s_t,a_t,r_t,s_{t+1},a_{t+1})$ sampled by current on-policy strategy.
2) Update $Q$ by the above rule.
Summary: DP vs MC vs TD (Key Points)
- Model requirement: DP requires a model; MC/TD do not.
- Bootstrap: DP/TD yes; MC no.
- Update timing: DP/TD can update immediately; MC requires complete episodes.
- Use of Markov property: DP/TD yes; MC no.
- Bias/variance: MC is unbiased but high variance; TD is biased but low variance.