Optimization & gradients
evergreen#foundations#optimization
Up: 10_Foundations-MOC
Bar: take one SGD step by hand, and derive the log-derivative (score-function) trick that underlies
policy gradients — the idea, not the framework.
Gradient descent
Minimize a loss L(θ) by stepping downhill:
θ←θ−η∇θL(θ)
η = learning rate. For convex L this finds the global minimum; for the non-convex losses of real
models it finds a good basin. Saddle points (gradient zero, not a minimum) are the typical obstacle in high
dimensions, not local minima.
Worked: one SGD step (by hand)
Linear model y^=wx, squared loss L=21(wx−y)2. Gradient ∂w∂L=(wx−y)x.
Take w=0, x=2, y=1, η=0.1: prediction 0, error (0⋅2−1)=−1, gradient =−1⋅2=−2.
w←0−0.1(−2)=0.2.
One nudge toward fitting the point. SGD just computes this gradient on a small random minibatch — a
noisy but unbiased estimate of the full gradient. The noise is a feature (escapes saddles, regularizes);
the cost is variance, tamed by learning-rate schedules and momentum (an EMA of past gradients). Adam etc.
are momentum + per-parameter step sizes — useful, but they're conveniences over this one idea.
The durable idea behind policy gradients
Sometimes you must optimize an expectation of something you can't differentiate through (a reward from a
sampled action, a non-differentiable metric):
J(θ)=Ex∼pθ[f(x)].
You can't push ∇ inside the sampling — unless you use the log-derivative / score-function trick:
∇θJ=∫f(x)∇θpθ(x)dx=∫f(x)pθ(x)∇θlogpθ(x)dx=Ex∼pθ[f(x)∇θlogpθ(x)].
The single step that makes it work: ∇p=p∇logp. Now the gradient is itself an expectation you
can estimate by sampling — sample x, weight ∇logpθ(x) by the reward f(x). That estimator
is REINFORCE; subtracting a baseline b from f reduces its variance without bias. Every fancy policy-
gradient method (advantage estimates, PPO clipping, …) is variance reduction on top of this. Learn this;
the frameworks are disposable (Beware coding-agent dragons).
Why this matters here
- It's how the Prediction models actually train (cross-entropy loss + SGD — see Cross-entropy and KL divergence).
- The score-function estimator shows up well beyond RL (variational inference, black-box optimization,
any "differentiate through a sampler" problem).
- Understanding the optimizer's knobs (LR, batch size, schedule) prevents the silent misconfigurations that
turn a small engineering slip into a wrong scientific conclusion.
By-hand exercise (meets the bar)
- Do one SGD step for logistic regression: p^=σ(wx), log-loss gradient =(p^−y)x. Use
w=0,x=1,y=1,η=0.5. (Answer: p^=0.5, grad =−0.5, w←0.25.)
- Derive ∇θJ above yourself, justifying each equality, then explain why a constant baseline
leaves it unbiased (E[∇logpθ]=0).
Links