Optimization & gradients

evergreen#foundations#optimization

Bar: take one SGD step by hand, and derive the log-derivative (score-function) trick that underlies policy gradients — the idea, not the framework.

Gradient descent

Minimize a loss $L(\theta)$ by stepping downhill:

\theta \leftarrow \theta - \eta\,\nabla_\theta L(\theta)

\eta

= learning rate. For convex

L

this finds the global minimum; for the non-convex losses of real models it finds a good basin. Saddle points (gradient zero, not a minimum) are the typical obstacle in high dimensions, not local minima.

Worked: one SGD step (by hand)

Linear model $\hat y = w x$ , squared loss $L=\tfrac12(wx-y)^2$ . Gradient $\dfrac{\partial L}{\partial w}=(wx-y)x$ . Take $w=0,\ x=2,\ y=1,\ \eta=0.1$ : prediction $0$ , error $(0\cdot2-1)=-1$ , gradient $=-1\cdot2=-2$ .

w \leftarrow 0 - 0.1(-2) = 0.2.

One nudge toward fitting the point. SGD just computes this gradient on a small random minibatch — a noisy but unbiased estimate of the full gradient. The noise is a feature (escapes saddles, regularizes); the cost is variance, tamed by learning-rate schedules and momentum (an EMA of past gradients). Adam etc. are momentum + per-parameter step sizes — useful, but they're conveniences over this one idea.

The durable idea behind policy gradients

Sometimes you must optimize an expectation of something you can't differentiate through (a reward from a sampled action, a non-differentiable metric):

J(\theta)=\mathbb{E}_{x\sim p_\theta}[f(x)].

You can't push

\nabla

inside the sampling — unless you use the log-derivative / score-function trick:

\nabla_\theta J = \int f(x)\,\nabla_\theta p_\theta(x)\,dx = \int f(x)\,p_\theta(x)\,\nabla_\theta\log p_\theta(x)\,dx = \mathbb{E}_{x\sim p_\theta}\!\big[f(x)\,\nabla_\theta\log p_\theta(x)\big].

The single step that makes it work:

\nabla p = p\,\nabla\log p

. Now the gradient is itself an expectation you can estimate by sampling — sample

x

, weight

\nabla\log p_\theta(x)

by the reward

f(x)

. That estimator is REINFORCE; subtracting a baseline

b

from

f

reduces its variance without bias. Every fancy policy- gradient method (advantage estimates, PPO clipping, …) is variance reduction on top of this. Learn this; the frameworks are disposable (Beware coding-agent dragons).

Why this matters here

It's how the Prediction models actually train (cross-entropy loss + SGD — see Cross-entropy and KL divergence).
The score-function estimator shows up well beyond RL (variational inference, black-box optimization, any "differentiate through a sampler" problem).
Understanding the optimizer's knobs (LR, batch size, schedule) prevents the silent misconfigurations that turn a small engineering slip into a wrong scientific conclusion.

By-hand exercise (meets the bar)

Do one SGD step for logistic regression: $\hat p=\sigma(wx)$ , log-loss gradient $=(\hat p-y)x$ . Use $w=0,x=1,y=1,\eta=0.5$ . (Answer: $\hat p=0.5$ , grad $=-0.5$ , $w\leftarrow0.25$ .)
Derive $\nabla_\theta J$ above yourself, justifying each equality, then explain why a constant baseline leaves it unbiased ( $\mathbb{E}[\nabla\log p_\theta]=0$ ).