Goal: Convert the two slide decks (Lecture 4: PG-I and Lecture 5: PG-II) into structured Markdown notes that comprehensively summarize the content, explain the math in depth, and clearly show how the ideas connect.
MDP & Value Functions
→ define $V^\pi, Q^\pi$, advantage $A^\pi = Q^\pi - V^\pi$; discounted state visitation $\lambda_\mu^\pi$ (a.k.a. discounted occupancy measure)
→ Performance Difference Lemma (PDL): compare two policies’ returns via advantages
→ Policy Gradient Theorem: multiple equivalent forms (REINFORCE/Q-form, advantage with baselines, state-visitation form)
→ Mirror Descent view: KL-regularized policy mirror ascent (PMA), exponential-weights update
→ Natural Policy Gradient (NPG): Fisher information, KL quadratic approx, trust region equivalence; least-squares solution
→ Convergence & complexity: vanilla PG vs NPG rates; exploration hardness constant $\kappa$, statistical error $\epsilon_{\text{stat}}$
→ Trust-region & proximal methods: TRPO, PPO (clipped), OPPO (optimistic PPO for finite horizon)
Value functions: \(V^\pi(s)=\mathbb{E}\!\left[\sum_{t\ge0}\gamma^t r_t \,\middle|\, s_0=s,\pi\right],\quad Q^\pi(s,a)=\mathbb{E}\!\left[\sum_{t\ge0}\gamma^t r_t \,\middle|\, s_0=s,a_0=a,\pi\right],\quad A^\pi=Q^\pi-V^\pi.\)
Discounted state visitation (from start distribution $\mu$): \(\lambda_\mu^\pi(s)=(1-\gamma)\,\mathbb{E}\!\left[\sum_{t\ge0}\gamma^t \mathbf{1}\{s_t=s\}\,\middle|\, s_0\sim\mu,\,\pi\right].\)
Statement. For a differentiable stochastic policy $\pi_\theta$, \(\nabla_\theta J(\pi_\theta) =\mathbb{E}_{\tau\sim p_\theta}\!\left[\sum_{t=0}^{\infty}\gamma^t\, Q^{\pi_\theta}(s_t,a_t)\,\nabla_\theta \log \pi_\theta(a_t\mid s_t)\right].\)
Sketch of derivation (key ideas).
Meaning. Weight the score function $\nabla\log\pi$ by the long-term payoff $Q^\pi$ at the visited state-action.
Replace $Q^\pi$ by the advantage $A^\pi$ without bias: \(\nabla_\theta J(\pi_\theta) =\frac{1}{1-\gamma}\,\mathbb{E}_{s\sim \lambda_\mu^{\pi_\theta},\,a\sim\pi_\theta} \!\left[A^{\pi_\theta}(s,a)\,\nabla_\theta\log\pi_\theta(a\mid s)\right].\)
Baselines. For any function $b(s)$, \(\mathbb{E}_{a\sim\pi(\cdot|s)}\!\big[\nabla_\theta\log\pi_\theta(a|s)\,b(s)\big]=0.\) Thus replacing $Q^\pi$ by $A^\pi=Q^\pi-b$ reduces variance while keeping the gradient unbiased.
For tabular direct parameterization, \(\frac{\partial J(\pi)}{\partial \pi(a|s)}=\frac{1}{1-\gamma}\,\lambda_\mu^\pi(s)\,Q^\pi(s,a).\) For softmax parameterization $\pi_\theta(a|s)\propto \exp(\theta_{s,a})$, \(\frac{\partial J(\theta)}{\partial \theta_{s,a}} =\frac{1}{1-\gamma}\,\lambda_\mu^{\pi_\theta}(s)\,\pi_\theta(a|s)\,A^{\pi_\theta}(s,a).\)
Meaning. Improvement comes from (i) visiting helpful states and (ii) increasing probability of high-advantage actions there.
For any two policies $\pi,\pi’$, \(J(\pi)-J(\pi') =\frac{1}{1-\gamma}\, \mathbb{E}_{s\sim \lambda_\mu^\pi,\,a\sim\pi(\cdot|s)}\!\left[A^{\pi'}(s,a)\right].\)
Interpretation. Relative performance is the new policy’s advantage under the old critic, reweighted by the new occupancy.
Mirror Descent with a strongly convex potential $\omega$ and Bregman $D_\omega$: \(x_{t+1}=\arg\min_{x\in\mathcal{X}} \left\{\langle\nabla f(x_t),x-x_t\rangle+\frac{1}{\eta_t}D_\omega(x,x_t)\right\}.\)
On the policy simplex with negative entropy (so $D_\omega$ is KL), we get Policy Mirror Ascent: \(\pi_{t+1}=\arg\max_{\pi}\Big\{ \langle\nabla J(\pi_t),\pi\rangle -\frac{1}{\eta}\sum_s \lambda_{\mu}^{\pi_t}(s)\,\mathrm{KL}\big(\pi(\cdot|s)\,\|\,\pi_t(\cdot|s)\big)\Big\}.\)
This yields a per-state exponential-weights update: \(\pi_{t+1}(a|s)\;\propto\;\pi_t(a|s)\, \exp\!\left(\frac{\eta}{1-\gamma}\,Q_t(s,a)\right).\)
Takeaway. Mirror ascent with a KL proximity term naturally leads to logit/softmax-style updates and connects to NPG.
Update. \(\theta_{t+1}=\theta_t+\eta\,\nabla_\theta J(\pi_{\theta_t}).\)
Gradient component. \(\frac{\partial J}{\partial \theta_{s,a}}=\frac{1}{1-\gamma}\,\lambda^\pi_\mu(s)\,\pi(a|s)\,A^\pi(s,a).\)
Convergence (clean setting). With exact gradients and a small enough step size, vanilla PG attains an $O(1/T)$ suboptimality bound in tabular softmax, but with large constants and sensitivity to exploration and logit saturation.
For a distribution family $p_\theta$, \(F_\theta=\mathbb{E}_{z\sim p_\theta}\!\left[\nabla_\theta\log p_\theta(z)\nabla_\theta\log p_\theta(z)^\top\right] =\nabla_\theta^2\,\mathrm{KL}\!\left(p_{\theta_0}\,\|\,p_\theta\right)\Big|_{\theta=\theta_0}.\)
Second-order expansion near $\theta_0$: \(\mathrm{KL}(p_{\theta_0}\,\|\,p_\theta)\approx \tfrac12(\theta-\theta_0)^\top F_{\theta_0}(\theta-\theta_0).\)
Meaning. Fisher defines the local geometry; equal-KL moves correspond to equal “statistical distance.”
Natural gradient update for minimizing $f$: \(\theta_{t+1}=\theta_t-\eta\,F_{\theta_t}^{\dagger}\nabla_\theta f(\theta_t).\)
Equivalent characterizations:
Takeaway. Natural gradient is steepest descent under the KL metric.
In RL with policy $\pi_\theta$, \(\theta_{t+1}=\theta_t+\eta\,F_{\theta_t}^{\dagger}\,\nabla_\theta J(\pi_{\theta_t}),\) with \(F_\theta=\mathbb{E}_{s\sim \lambda_\mu^{\pi_\theta},\,a\sim\pi_\theta} \!\left[\nabla_\theta\log\pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s)^\top\right], \quad \nabla_\theta J(\pi_\theta)=\frac{1}{1-\gamma}\, \mathbb{E}\!\left[A^{\pi_\theta}(s,a)\,\nabla_\theta\log\pi_\theta(a|s)\right].\)
Interpretation. Scale the vanilla gradient by the inverse Fisher to respect policy geometry and control step size in KL.
Define $x(s,a)=\nabla_\theta\log\pi_\theta(a|s)$ and solve \(w^\star=\arg\min_{w}\; \mathbb{E}_{s\sim\lambda_\mu^{\pi_\theta},\,a\sim\pi_\theta} \big[\big(w^\top x(s,a)-A^{\pi_\theta}(s,a)\big)^2\big].\) Then \((1-\gamma)\,F_\theta^{\dagger}\nabla_\theta J(\pi_\theta)=w^\star(\theta), \qquad \theta_{t+1}=\theta_t+\frac{\eta}{1-\gamma}\,w^\star(\theta_t).\)
Meaning. NPG direction = least-squares fit of advantages by policy scores; compute via regression without explicit matrix inversion.
Under tabular softmax logits, \(\pi_{t+1}(a|s)\;\propto\;\pi_t(a|s)\, \exp\!\left(\frac{\eta}{1-\gamma}\,A^{\pi_t}(s,a)\right),\) which matches the policy mirror ascent exponential-weights update.
Takeaway. NPG and KL-regularized PMA are equivalent in softmax/tabular settings.
Algorithm | Typical (tabular) rate | Comments | ||||
---|---|---|---|---|---|---|
Vanilla PG (softmax) | $O(1/T)$ with large constants depending on $ | S | $, $(1-\gamma)^{-1}$, minimal action mass $c^{-1}$ | Sensitive to exploration and logit saturation | ||
Tabular NPG | $O(1/((1-\gamma)^2 T))$ with better constants | Geometry-aware steps; monotonic improvements under TR constraints | ||||
Sample-based NPG | $\tilde O(1/\sqrt{T})+\sqrt{\kappa}\,\epsilon_{\text{stat}}$ | Regression view; quality depends on advantage estimation | ||||
OPPO | $\tilde O( | S | A | /\sqrt{(1-\gamma)^3T})$ (finite horizon analogs) | Uses optimism (bonus) + exponential-weights updates |
Symbols: $c^{-1}=\min_{s,t}\pi_t(a^\star(s)\mid s)$ lower-bounds optimal-action probability; $\kappa$ is an exploration hardness constant.
Idea. Improve a surrogate objective while constraining the mean KL from the old policy: \(\max_\theta\; \mathbb{E}_{s\sim\lambda^{\pi_{\theta_t}}_\mu,\,a\sim\pi_{\theta_t}} \!\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}\,A^{\pi_{\theta_t}}(s,a)\right] \quad\text{s.t.}\quad \mathbb{E}_{s\sim\lambda^{\pi_{\theta_t}}_\mu}\!\left[\mathrm{KL}\!\big(\pi_\theta(\cdot|s)\,\|\,\pi_{\theta_t}(\cdot|s)\big)\right]\le\delta.\)
Implementation sketch.
Relation. TRPO is a trust-region realization of NPG; both embody the KL geometry.
Clipped surrogate avoids solving a constrained problem each iteration: \(\max_\theta\; \mathbb{E}\,\min\!\Big\{ \rho_t(\theta)\,A_t,\; \mathrm{clip}\!\big(\rho_t(\theta),1-\epsilon,1+\epsilon\big)\,A_t \Big\}, \quad \rho_t(\theta)=\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}.\)
Intuition. The clip forms an implicit trust region on the likelihood ratio; easy to optimize with SGD; strong empirical performance.
Tuning. $\epsilon\in[0.1,0.3]$ common; combine with value loss and entropy bonus.
Setup. Non-stationary policies ${\pi_h}_{h=1}^H$ and values $V_h,Q_h$ per stage.
Key idea. Add optimistic bonuses (e.g., count-based) to build optimistic $Q_h$ estimates, then do exponential-weights/NPG-style policy updates: \(\pi_{t+1,h}(a|s)\;\propto\;\pi_{t,h}(a|s)\,\exp\!\big(\eta\,Q_{t,h}^{\text{optimistic}}(s,a)\big).\) Benefit. Encourages exploration with finite-horizon structure and yields improved theoretical sample complexity.
PG Theorem
→ switch to advantage for variance reduction
→ view updates as Mirror Descent with KL (policy mirror ascent)
→ understand the geometry via Fisher/KL → Natural Gradient
→ implement as NPG (steepest ascent under KL)
⇄ realize as TRPO (explicit trust region)
→ simplify to PPO (clipped surrogate)
→ extend to finite-horizon OPPO (optimism + exponential weights)
PDL ties surrogate improvement to true return improvement, justifying trust-region constraints and step-size selection.
For any $b(s)$, \(\mathbb{E}_{a\sim\pi(\cdot|s)}\!\left[\nabla_\theta\log\pi_\theta(a|s)\,b(s)\right] =b(s)\,\nabla_\theta \sum_a \pi_\theta(a|s)=b(s)\,\nabla_\theta 1=0.\) Hence replacing $Q$ by $A=Q-b(s)$ keeps $\nabla_\theta J$ unchanged (unbiasedness).
Using Bellman equations and telescoping sums, \(J(\pi)-J(\pi') =\frac{1}{1-\gamma}\, \mathbb{E}_{s\sim \lambda_\mu^\pi,\,a\sim \pi}\!\left[ Q^{\pi'}(s,a)-V^{\pi'}(s)\right] =\frac{1}{1-\gamma}\,\mathbb{E}_{\lambda_\mu^\pi,\pi}\!\left[A^{\pi'}(s,a)\right].\) This motivates evaluating a new policy with advantages under a reference policy.
Let $x(s,a)=\nabla_\theta\log\pi_\theta(a|s)$. The regression normal equations give \(\mathbb{E}[x\,x^\top]\,w=\mathbb{E}[x\,A] \;\Rightarrow\; w=F_\theta^{-1}\,\mathbb{E}[x\,A]=(1-\gamma)\,F_\theta^{-1}\nabla_\theta J.\) Thus $w$ solves the least-squares fit of $A$ by score features and equals the NPG direction up to $(1-\gamma)$.
PG (softmax): $\partial J/\partial \theta_{s,a}=\tfrac{1}{1-\gamma}\lambda^\pi_\mu(s)\pi(a | s)A(s,a)$. |