Policy Gradients

The slides: https://cs224r.stanford.edu/slides/03_cs224r_policy_gradients_2025.pdf

The objective of RL is to find the best $\theta$ to maximize the expected value of total awards:

$J(\theta) = E_{\tau \sim p_{\theta} (\tau)}[\sum_t r(s_t, a_t)]$.

The trivial way to approximate $J(\theta)$ is $\frac{1}{N} \sum_i \sum_t r(s_{i, t}, a_{i, t})$.

Now the goal is to compute the gradient of $J(\theta)$.

Let $r(\tau) = \sum_t r(s_t, a_t)$, $J(\theta) = E_{\tau \sim p_{\theta} (\tau)} r(\tau)=\int p_{\theta} (\tau) r(\tau) d\tau$

The gradient is: $\nabla J(\theta) = \int \nabla p_{\theta} (\tau) r(\tau) d\tau$.

There is a trick here: $p_\theta(\tau) \nabla log p_\theta(\tau)=p_\theta(\tau) \frac{\nabla p_\theta(\tau)}{p_\theta(\tau)} = \nabla p_\theta(\tau)$. Replace this back:

$\nabla J(\theta) = \int p_\theta(\tau) \nabla log p_\theta(\tau) r(\tau) d\tau = E_{\tau \sim p_\theta(\tau)}[\nabla log p_\theta(\tau) r(\tau)]$

We can now use the original variables of $p_\theta (\tau)$:

$p_\theta (\tau) = p(s_1)\prod t \pi{\theta}(a_t | s_t)p(s_{t+1}|s_t, a_t)$.

$log p_{\theta}(\tau)=logp(s_1)+\sum_t log\pi_{\theta}(a_t|s_t)+logp(s_{t+1}|s_t,a_t)$

Because p is irrelevant to $\theta$, all of those forms are eliminated after $\nabla$ operator:

$\nabla J(\theta)=E_{\tau \sim p_\theta(\tau)}[(\sum_t \nabla log \pi_{\theta}(a_t|s_t))(\sum_t r(s_t, a_t)]$.

This term can be approximated by averaging all rewards from all samples.