habib's rabbit hole

PPO for LLMs: Why We Clip, Why We Normalize, and Why It's Still Hard

Why are we here?

the wall of supervised learning

after spending thousands of hours of gpu training and an insane amount of dollars, we observe that though the model is technically correct at predicting the next word, it is kinda useless.in standard pre-training or sft (supervised fine-tuning) approaches we teach llms to mimic patterns. the loss function is a simple cross entropy loss which penalizes the model for saying apple instead of orange. the problem is simple: just because a sentence has a high likelihood does not mean that it is useful — meaning it should align with our preferences by being helpful or safe.

Ziegler et al dropped a truth bomb in 2020 suggesting that we do not need better models. instead we need a better way to reward the model. so instead of going for better prediction we want our models to optimize for a reward that will in turn enhance the quality of the model's outputs.

the logic was simple:

the foundation:

to get the llm to generate the way humans want them to be or in other words follow the human preference we start with the grand-daddy of of rl : the policy gradient (reinforce).

starting point : vanilla policy gradient

in llms, the policy (π) is the model itself. so for every prompt you give to the model, it will generate some response (sequence of tokens). all we want to do is to tweak the weights (θ) so that the high reward sequence becomes more likely.

the math:

LPG(θ)=𝔼^t[logπθ(at|st)A^t]

where :

if it is positive -> response was better than the expected and thus we would like to increase the probability of the generated response.

the problem : sample efficiency

vanilla policy gradient has one flaw - and it is huge (that's what she said) , it is a "one and done" (that's what he said) system.

a bit more deep dive into the above problem:

τ~πθ0

we compute the gradient (g0) to see how to improve the model.

g0=𝔼τ~πθ0[tθ0logπθ0(atst)A^t]

we apply the update to get a new, "smarter" model

(θ1)=θ0+αg0

here’s the catch: why the data becomes stale.

now you want to do another update to reach θ2. you look at the pile of data you collected in step 1. can you use it again? mathematically, No.

after the update, the correct gradient should be:

θ1J(θ1)=𝔼τ~πθ1[tθ1logπθ1(atst)A^t]

notice the subscript. the expectation 𝔼 must be taken over trajectories generated by πθ1.

the problem: your data came from

τ~πθ0

but your model is now

θ1.

Formally:

πθ0(τ)πθ1(τ)

the solution ? importance sampling

to save compute so that we don't waste our data, we need to find a way to reuse it. we want to collect the bunch of responses using an old version of the model (πθold) and then use them to update the new model multiple times. also we need to keep in check that the updates are not crazy! meaning we have to take into account the factor of distribution shift! our new model should not be much different from the old model. to do this we use a ratio to correct the math.

the logic : probability ratio (rt):

rt(θ)=πθ(at|st)πθold(at|st)

imagine the old model gave the word "helpful" a 10% chance.

now, instead of just using logπ, we multiply the advantage by this ratio (1.2). this "importance sampling" trick lets us recycle data safely.

the clip - ppo's central innovation

so we talked about advantage right. a positive advantage means the model did something right and that we want to increase its probability. also we do not want our model to drift too much, so we introduce a limit! same when we want to decrease its probability. the advantage is less than 0, so we will want to decease the probability but it should not be decreased so much so that it change the policy entirely.

LCLIP(θ)=𝔼^t[min(rt(θ)A^tOriginal,clip(rt(θ),1ϵ,1+ϵ)A^tClipped)]

at first glance the above equation looks ugly but it is very simple

what ppo wants to do ? (without clipping)

if we ignore clipping, the only objective of clipping is to increase the probability of a response when the advantage is positive and decrease it when it is negative.

so far so good.

what ppo refuses to do

ppo refuses to let this policy change like a maniac in one step. this where this clipping comes in.

it enforces a hard rule which states:

“no matter what the gradient wants, don’t change action probabilities by more than ±ε.”

if ε = 0.2:

this is a trust region, but enforced with algebra instead of constraints.

why the min? the safety

ppo takes : min(unclipped objective,clipped objective)

why?

because if clipping helps in achieving the objective then we will ignore this min thing and go with what clipping has to offer, but if it hurts the objective then we will enforce it. in other words:

“if a large update would make things look better, i don’t trust it. i’ll take the safer improvement.”

this makes ppo a conservative optimizer by design.

the ϵ choice:

kl penalty

why? so even after having clipping which prevents the model from not changing too much in one update, the issue of eventually drifting into the gibberish land still persists over thousands of updates. thus we need kl penaltyyyyyyyyyyyyyyyy.

the full objective:

LPPO(θ)=𝔼^t[LtCLIP(θ)c1LtVF(θ)+c2S[πθ](st)]β·KL(πθ||πref)

what is what?

  1. LtCLIP(θ): The "Actor" (the policy we’ve been talking about).

  2. c1LtVF(θ): The "Critic" (Value Function). It helps predict rewards to reduce noise.

  3. c2S[πθ]: Entropy. It forces the model to keep exploring and not get stuck on one answer.

  4. β·KL(πθ||πref): The Penalty. This ensures the model doesn't become a "Reward Hacker."

the kl penalty term - a closer look:

β·KL(πθ||πref)

the term basically says

“every time you update, pay a tax for moving away from a trusted reference policy.”

what, or rather who is the reference model/policy?

two common choices are there for the reference model:

  1. episode level reference : where the old policy is treated as the reference model
  1. fixed reference (llm style): here the reference model is the sft model.

this is very critical for llms as without it rlhf models will lose fluency, hallucinate structures and optimize reward at the cost of meaning.

how is clipping different from KL divergence?

well it is very straightforward. clipping is concerned about whether the step is too big for now while kl divergence is concerned about how far has the new policy wandered overall from the reference policy. one is local while the other is global and for proper optimization we need both.

the role of β:

it control how strongly are we about to change

penalty is:

β·KL(πθ||πref)

if it is small then training will be fast and there will be high risk of drift and if it is large then training will be slow and v.v.

fixed vs adaptive β\

fixed values of beta are simple and easy to implement but at the same time is we use them we allow kl to grow unpredictably. adaptive beta values are more principled and adjusted to hit a target kl.

although most systems today use a combination of fixed beta and early stopping on kl as it is simple, robust and operationally predictable.

normalization

by this point if you see, ppo is solid right? we have reused the old data, used clipping to keep the updates conservative and kl penalty to prevent long term drift. despite all this, without normalization it all can still fail.

why it matters?

see the core update term of ppo is: rt(θ)A^t. this tinyyy multiplication decides 3 things:

  1. gradient magnitude : how much weights actually change
  2. clipping activation : whether we hit the 1±ϵ wall.
  3. dominance : which samples will get to teach the model

see advantages are noisy and they depend on reward models which are often imperfect and value estimates which can be wrong.

  1. failure mode 1 : let us say A^t is huge (say, +500), then rtA^t becomes massive. the problem that arises with this is that gradient explodes and only few reward samples will dominate the entire update and ppo will turn back into unstable vanilla pg.

  2. failure mode 2 : if A^t is tiny (say, 0.0001), the gradients vanish. because of this no learning happens and though the model and its training look stable, it does not learn anything.

how to normalize:

step a : structured advantage via GAE

before normalizing we need "clean" advantages. for llms we use generalised advantage estimation (GAE):

the math:

A^t=l=0(γλ)lδt+lV

where the TD (temporal difference) Residual is :

δtV=rt+γV(st+1)V(st)

so what just happened here? let us roll back a little to get the full understanding of the things.

we want to train a model that makes decisions and to improve this decision taking ability of the model we need to answer one question repeatedly:

"was this decision better or worse than expected?"

this number is called advantage.

first you find the value function: V(st) this is just a prediction about how good of a future is going to be if you are present in state st. for llms state is nothing but the prompt and the tokens that have been generated so far and the value is the predicted future reward from here. all of this is learned by a critic network.

second we find the immediate reward rt. as the name suggests this is the reward that you get when you take the action.

in rlhf it is mostly 0 at most of the timsteps.

then we move to the TD error :

δtV=rt+γV(st+1)V(st)

just look at the above terms, they basically translate to : what actually happened - what i expected to happen.

$ V(s_t) $ : what i thought would happen rt+γV(st+1) : what actually happened - reward + future

if this term is positive then the things were better than expected and if it is negative then they were not. simple

suppose :

then :

δtV=rt+γV(st+1)V(st)=1+(1×6)5=2

meaning : things turned out to be 2 units better than i expected them to be. this acts as a basic learning signal.

so the thing is that TD error is a very local (one time step taken into account) thing and it also misses the long term effects due to it. so in long sequences (that are obv present in llms ) single step TD cannot do much justice. we need some way to take into account the future as well.

comes in GAE. it preaches to not only look at the current surprises but also at the future surprises, but with a catch, which is to not trust them more as we move further away.

coming back to that equation again that we mentioned at the top:

A^t=l=0(γλ)lδt+lV

so in simple words GAE is just a discounted sum of TD errors that is taken so that not just the current one

but why two of these factors? this is kinda subtle but v important. see gamma is the environment discount which takes into account "how much do the future rewards matter?" while lambda is the smoothing knob which tells "how much do i trust the future TD errors?"

why is it still hard?

by now ppo looks a 10/10 thingy that's gonna solve all of our problems . you get the math the clipping, the normalization but there are some challenges.

challenge 1 : reward model optimization (goodhart's law)

when a measure becomes a target, it ceases to be a good measure.

the reward model is just a mathematical guess of what humans like. if we train too hard then the llm finds blind spots and it learns to output repetitive, perfectly grammatical non sense that tricks the reward model into giving a high score even though humans would hate it.

challenge 2 : the length bias

the reward models are inherently biased towards long responses and they think that longer the response the better is the quality.

size matters

because of this ppo forces the model to become a "yap - god" so that it keeps on generating longer responses for a better reward.

challenge 3 : training instability at scale

while training larger models (70b param) we hit NaN values, loss spikes and all of this is due to issues like gradient explosion and mixed precision.

beyond ppo

ppo is still the gold standard but also very high maintenance. the newer methods are trying to reach the same level of alignment without the 4 models in memory headache.

why it still persists:

  1. flexibility: can incorporate any reward signal
  2. robustness: kl penalty provides guardrails
  3. proven: works at 100B+ scale (GPT-4, Claude, etc.)

references:

foundational papers

  1. ziegler et al.
  2. christiano et al.
  3. ouyang et al.
  4. schulman et al.

further reading (if you want to go deeper)