Optimization & gradients

evergreen#foundations#optimization

Up: 10_Foundations-MOC

Bar: take one SGD step by hand, and derive the log-derivative (score-function) trick that underlies policy gradients — the idea, not the framework.

Gradient descent

Minimize a loss L(θ)L(\theta) by stepping downhill:

θθηθL(θ)\theta \leftarrow \theta - \eta\,\nabla_\theta L(\theta)
η\eta = learning rate. For convex LL this finds the global minimum; for the non-convex losses of real models it finds a good basin. Saddle points (gradient zero, not a minimum) are the typical obstacle in high dimensions, not local minima.

Worked: one SGD step (by hand)

Linear model y^=wx\hat y = w x, squared loss L=12(wxy)2L=\tfrac12(wx-y)^2. Gradient Lw=(wxy)x\dfrac{\partial L}{\partial w}=(wx-y)x. Take w=0, x=2, y=1, η=0.1w=0,\ x=2,\ y=1,\ \eta=0.1: prediction 00, error (021)=1(0\cdot2-1)=-1, gradient =12=2=-1\cdot2=-2.

w00.1(2)=0.2.w \leftarrow 0 - 0.1(-2) = 0.2.
One nudge toward fitting the point. SGD just computes this gradient on a small random minibatch — a noisy but unbiased estimate of the full gradient. The noise is a feature (escapes saddles, regularizes); the cost is variance, tamed by learning-rate schedules and momentum (an EMA of past gradients). Adam etc. are momentum + per-parameter step sizes — useful, but they're conveniences over this one idea.

The durable idea behind policy gradients

Sometimes you must optimize an expectation of something you can't differentiate through (a reward from a sampled action, a non-differentiable metric):

J(θ)=Expθ[f(x)].J(\theta)=\mathbb{E}_{x\sim p_\theta}[f(x)].
You can't push \nabla inside the sampling — unless you use the log-derivative / score-function trick:
θJ=f(x)θpθ(x)dx=f(x)pθ(x)θlogpθ(x)dx=Expθ ⁣[f(x)θlogpθ(x)].\nabla_\theta J = \int f(x)\,\nabla_\theta p_\theta(x)\,dx = \int f(x)\,p_\theta(x)\,\nabla_\theta\log p_\theta(x)\,dx = \mathbb{E}_{x\sim p_\theta}\!\big[f(x)\,\nabla_\theta\log p_\theta(x)\big].
The single step that makes it work: p=plogp\nabla p = p\,\nabla\log p. Now the gradient is itself an expectation you can estimate by sampling — sample xx, weight logpθ(x)\nabla\log p_\theta(x) by the reward f(x)f(x). That estimator is REINFORCE; subtracting a baseline bb from ff reduces its variance without bias. Every fancy policy- gradient method (advantage estimates, PPO clipping, …) is variance reduction on top of this. Learn this; the frameworks are disposable (Beware coding-agent dragons).

Why this matters here

By-hand exercise (meets the bar)

  1. Do one SGD step for logistic regression: p^=σ(wx)\hat p=\sigma(wx), log-loss gradient =(p^y)x=(\hat p-y)x. Use w=0,x=1,y=1,η=0.5w=0,x=1,y=1,\eta=0.5. (Answer: p^=0.5\hat p=0.5, grad =0.5=-0.5, w0.25w\leftarrow0.25.)
  2. Derive θJ\nabla_\theta J above yourself, justifying each equality, then explain why a constant baseline leaves it unbiased (E[logpθ]=0\mathbb{E}[\nabla\log p_\theta]=0).