habib's rabbit hole

Bradley Terry Model

Introduction

The Bradley-Terry model is the secret sauce behind how we rank things that don't have a clear "score." If you've ever wondered how sports teams are ranked when they haven't all played each other, or how Netflix knows which movie is "better" based on your clicks, you’ve met the Bradley-Terry model

Visualization Tool Link

The Problem - Incomplete Universe!

Imagine you are running a massive ping-pong tournament with 1,000 players. To find out exactly who is the best using a traditional leaderboard, everyone would have to play everyone else. That is 499,500 matches.

If each player only plays 5 matches, your leaderboard is a lie. Player A might have 5 wins against newbies, while Player B has 2 wins against professionals. Who is actually better?

This is why we need the Bradley-Terry Model (BTM). It doesn't care about your win-loss record; it cares about the quality of your opposition. It turns a messy web of random encounters into a single, clean "Power Ranking."

Core Philosophy - "The Latent Worth"

It assumes that every team/player/option has a latent (hidden) worth we can think of it as strength called β (beta). And the thing about this term beta is that you can physically measure it. You can see it only when two things collide - which has a higher value of beta!

The winning probability

The BTM defines the probability of i beating j as:

P(i > j) = βᵢ / (βᵢ + βⱼ)

But in modern machine learning,β is a bit of a nightmare. It has to be a positive number, which is hard for a neural network to output consistently without a lot of extra math. Instead, we use Reward, r. Think of the reward as the "internal score" a model gives to an answer. It can be any number—positive, negative, or zero. We link them using the exponential constant: βi=eri

The Genius of "Sigmoid"

By substituting er into our original Bradley-Terry formula, we get a beautiful mathematical transformation. If we want the probability of i beating j:

P(i > j) = eʳᵢ / (eʳᵢ + eʳⱼ)

If we divide both the numerator and denominator by eri, the equation collapses into the Sigmoid Function:

P(i > j) = 1 / (1 + e⁻⁽ʳᵢ − ʳⱼ⁾) = σ(rᵢ − rⱼ)

Why is this a breakthrough? Because the model no longer cares about the absolute value of the rewards. It only cares about the difference between them (rirj). This turns a comparison problem into a simple distance problem on a number line.

Training: "Scolding" the Model with Loss

This is where the Bradley-Terry model becomes an instructor. During training, we show the model a "Winner" (yw) and a "Loser" (yl). We want the probability of the winner beating the loser to be as close to 1.0 (100%) as possible.To do this, we use a Negative Log-Loss function:

Loss=ln(σ(rwrl))

Let's look at the "Scold" in action

Imagine the model gets cocky and assigns these rewards:

The difference is (23)=1.

The Sigmoid of 1 is roughly 0.27.

This means the model thinks the winner only had a 27% chance of winning. It’s wrong.When we plug that into our loss function:

Loss=ln(0.27)1.30 (High Loss)

This high loss value acts like a "scold" from the optimizer. It tells the model: "You were way off! Increase the reward for the winner and drop the reward for the loser." The model adjusts its weights, the gap (rwrl) grows positive, the Sigmoid moves toward 1, and the Loss drops toward 0. This is how AI learns to prefer "good" answers over "bad" ones.