PPO for LLMs: Why We Clip, Why We Normalize, and Why It's Still Hard
Why are we here?
the wall of supervised learning
after spending thousands of hours of gpu training and an insane amount of dollars, we observe that though the model is technically correct at predicting the next word, it is kinda useless.in standard pre-training or sft (supervised fine-tuning) approaches we teach llms to mimic patterns. the loss function is a simple cross entropy loss which penalizes the model for saying apple instead of orange. the problem is simple: just because a sentence has a high likelihood does not mean that it is useful — meaning it should align with our preferences by being helpful or safe.
Ziegler et al dropped a truth bomb in 2020 suggesting that we do not need better models. instead we need a better way to reward the model. so instead of going for better prediction we want our models to optimize for a reward that will in turn enhance the quality of the model's outputs.
the logic was simple:
- let the llm generate a response
- give the response a score - a reward depending on how good it is - given by humans thus human preference is taken into account
- use PPO to nudge the model towards a higher score / reward.
the foundation:
to get the llm to generate the way humans want them to be or in other words follow the human preference we start with the grand-daddy of of rl : the policy gradient (reinforce).
starting point : vanilla policy gradient
in llms, the policy () is the model itself. so for every prompt you give to the model, it will generate some response (sequence of tokens). all we want to do is to tweak the weights () so that the high reward sequence becomes more likely.
the math:
where :
the goal: this is the loss function (or more accurately, the objective function) we want to maximize.
the context: this stands for the Empirical Expectation at time . We don't just look at one word; we look at a whole "batch" of generated text and take the average performance.
the action : this is the log probability of the model taking a specific action () given the current state ()
- state (): is the prompt and all the words generated yet
- action () : next word model chooses
the feedback (): this is the advantage estimate and it tells the model how much better was the generated response than the average expected response.
if it is positive -> response was better than the expected and thus we would like to increase the probability of the generated response.
: This is the "log-probability" of the model choosing a specific word.
(The Advantage): This is the secret sauce. It tells us: "Was this word better than what we expected?"
the problem : sample efficiency
vanilla policy gradient has one flaw - and it is huge (that's what she said) , it is a "one and done" (that's what he said) system.
- one trajectory, one gradient update : you gen a response, calculate the gradient and then update the model. once the model weights change the old data becomes stale and thus you have to throw it because the model is like "i have never seen this man (data) in my life before!".
a bit more deep dive into the above problem:
- step 1: the sampling first we let our current model () talk.
- step 2 : the calculation
we compute the gradient () to see how to improve the model.
- step 3 : the update
we apply the update to get a new, "smarter" model
here’s the catch: why the data becomes stale.
now you want to do another update to reach . you look at the pile of data you collected in step 1. can you use it again? mathematically, No.
after the update, the correct gradient should be:
notice the subscript. the expectation must be taken over trajectories generated by .
the problem: your data came from
but your model is now
.
Formally:
the solution ? importance sampling
to save compute so that we don't waste our data, we need to find a way to reuse it. we want to collect the bunch of responses using an old version of the model () and then use them to update the new model multiple times. also we need to keep in check that the updates are not crazy! meaning we have to take into account the factor of distribution shift! our new model should not be much different from the old model. to do this we use a ratio to correct the math.
the logic : probability ratio ():
imagine the old model gave the word "helpful" a 10% chance.
old probability (): 0.10
new probability (): 0.12 (after a small update)
ratio ():
now, instead of just using , we multiply the advantage by this ratio (). this "importance sampling" trick lets us recycle data safely.
the clip - ppo's central innovation
so we talked about advantage right. a positive advantage means the model did something right and that we want to increase its probability. also we do not want our model to drift too much, so we introduce a limit! same when we want to decrease its probability. the advantage is less than 0, so we will want to decease the probability but it should not be decreased so much so that it change the policy entirely.
at first glance the above equation looks ugly but it is very simple
what ppo wants to do ? (without clipping)
if we ignore clipping, the only objective of clipping is to increase the probability of a response when the advantage is positive and decrease it when it is negative.
so far so good.
what ppo refuses to do
ppo refuses to let this policy change like a maniac in one step. this where this clipping comes in.
it enforces a hard rule which states:
“no matter what the gradient wants, don’t change action probabilities by more than ±ε.”
if ε = 0.2:
max increase = 1.2×
max decrease = 0.8×
this is a trust region, but enforced with algebra instead of constraints.
why the min? the safety
ppo takes : min(unclipped objective,clipped objective)
why?
because if clipping helps in achieving the objective then we will ignore this min thing and go with what clipping has to offer, but if it hurts the objective then we will enforce it. in other words:
“if a large update would make things look better, i don’t trust it. i’ll take the safer improvement.”
this makes ppo a conservative optimizer by design.
the ϵ choice:
- ϵ=0.2 is standard (20% probability change max)
- smaller ϵ = safer but slower learning
- larger ϵ = faster but riskier
kl penalty
why? so even after having clipping which prevents the model from not changing too much in one update, the issue of eventually drifting into the gibberish land still persists over thousands of updates. thus we need kl penaltyyyyyyyyyyyyyyyy.
the full objective:
what is what?
: The "Actor" (the policy we’ve been talking about).
: The "Critic" (Value Function). It helps predict rewards to reduce noise.
: Entropy. It forces the model to keep exploring and not get stuck on one answer.
: The Penalty. This ensures the model doesn't become a "Reward Hacker."
the kl penalty term - a closer look:
the term basically says
“every time you update, pay a tax for moving away from a trusted reference policy.”
what, or rather who is the reference model/policy?
two common choices are there for the reference model:
- episode level reference : where the old policy is treated as the reference model
rest at the start of each PPO rollout
prevents aggressive per-iteration drift
used in classic PPO
- fixed reference (llm style): here the reference model is the sft model.
never changes
represents “human-like language”
acts as a long-term anchor
this is very critical for llms as without it rlhf models will lose fluency, hallucinate structures and optimize reward at the cost of meaning.
how is clipping different from KL divergence?
well it is very straightforward. clipping is concerned about whether the step is too big for now while kl divergence is concerned about how far has the new policy wandered overall from the reference policy. one is local while the other is global and for proper optimization we need both.
the role of :
it control how strongly are we about to change
penalty is:
if it is small then training will be fast and there will be high risk of drift and if it is large then training will be slow and v.v.
fixed vs adaptive
fixed values of beta are simple and easy to implement but at the same time is we use them we allow kl to grow unpredictably. adaptive beta values are more principled and adjusted to hit a target kl.
although most systems today use a combination of fixed beta and early stopping on kl as it is simple, robust and operationally predictable.
normalization
by this point if you see, ppo is solid right? we have reused the old data, used clipping to keep the updates conservative and kl penalty to prevent long term drift. despite all this, without normalization it all can still fail.
why it matters?
see the core update term of ppo is: . this tinyyy multiplication decides 3 things:
- gradient magnitude : how much weights actually change
- clipping activation : whether we hit the wall.
- dominance : which samples will get to teach the model
see advantages are noisy and they depend on reward models which are often imperfect and value estimates which can be wrong.
failure mode 1 : let us say is huge (say, +500), then becomes massive. the problem that arises with this is that gradient explodes and only few reward samples will dominate the entire update and ppo will turn back into unstable vanilla pg.
failure mode 2 : if is tiny (say, 0.0001), the gradients vanish. because of this no learning happens and though the model and its training look stable, it does not learn anything.
how to normalize:
step a : structured advantage via GAE
before normalizing we need "clean" advantages. for llms we use generalised advantage estimation (GAE):
the math:
where the TD (temporal difference) Residual is :
so what just happened here? let us roll back a little to get the full understanding of the things.
we want to train a model that makes decisions and to improve this decision taking ability of the model we need to answer one question repeatedly:
"was this decision better or worse than expected?"
this number is called advantage.
first you find the value function: this is just a prediction about how good of a future is going to be if you are present in state . for llms state is nothing but the prompt and the tokens that have been generated so far and the value is the predicted future reward from here. all of this is learned by a critic network.
second we find the immediate reward . as the name suggests this is the reward that you get when you take the action.
in rlhf it is mostly 0 at most of the timsteps.
then we move to the TD error :
just look at the above terms, they basically translate to : what actually happened - what i expected to happen.
$ V(s_t) $ : what i thought would happen : what actually happened - reward + future
if this term is positive then the things were better than expected and if it is negative then they were not. simple
suppose :
- = 5 --- i expect a reward unit of 5 from here
- i take an action
- i get = 1
- this means that the next state value will be 6
- let = 1 for simplicity
then :
meaning : things turned out to be 2 units better than i expected them to be. this acts as a basic learning signal.
so the thing is that TD error is a very local (one time step taken into account) thing and it also misses the long term effects due to it. so in long sequences (that are obv present in llms ) single step TD cannot do much justice. we need some way to take into account the future as well.
comes in GAE. it preaches to not only look at the current surprises but also at the future surprises, but with a catch, which is to not trust them more as we move further away.
coming back to that equation again that we mentioned at the top:
- : surprise now
- : surprise one step later
- : surprise two steps later...
- each of the subsequent surprises get discounted by
so in simple words GAE is just a discounted sum of TD errors that is taken so that not just the current one
but why two of these factors? this is kinda subtle but v important. see gamma is the environment discount which takes into account "how much do the future rewards matter?" while lambda is the smoothing knob which tells "how much do i trust the future TD errors?"
why is it still hard?
by now ppo looks a 10/10 thingy that's gonna solve all of our problems . you get the math the clipping, the normalization but there are some challenges.
challenge 1 : reward model optimization (goodhart's law)
when a measure becomes a target, it ceases to be a good measure.
the reward model is just a mathematical guess of what humans like. if we train too hard then the llm finds blind spots and it learns to output repetitive, perfectly grammatical non sense that tricks the reward model into giving a high score even though humans would hate it.
challenge 2 : the length bias
the reward models are inherently biased towards long responses and they think that longer the response the better is the quality.
size matters
because of this ppo forces the model to become a "yap - god" so that it keeps on generating longer responses for a better reward.
challenge 3 : training instability at scale
while training larger models (70b param) we hit NaN values, loss spikes and all of this is due to issues like gradient explosion and mixed precision.
beyond ppo
ppo is still the gold standard but also very high maintenance. the newer methods are trying to reach the same level of alignment without the 4 models in memory headache.
why it still persists:
- flexibility: can incorporate any reward signal
- robustness: kl penalty provides guardrails
- proven: works at 100B+ scale (GPT-4, Claude, etc.)
references:
foundational papers
further reading (if you want to go deeper)
schulman et al. (2015) — trust region policy optimization (trpo) ppo's predecessor — the constrained optimization approach that ppo simplified.
stiennon et al. (2020) — learning to summarize with human feedback the bridge between christiano 2017 and instructgpt — showed rlhf works at scale on real nlp tasks.
bai et al. (2022) — training a helpful and harmless assistant with rlhf (anthropic / constitutional ai) how anthropic applied rlhf + additional safety alignment, relevant to the "beyond ppo" section.