The same 72-year-old equation ranks chess grandmasters and GPT-4.

I was deep in a rabbit hole on competitive gaming rating systems, Elo, TrueSkill, OpenSkill, trying to understand how to measure individual player skill in team games. The math underneath is a model called Bradley-Terry from 1952: given two players, the probability A beats B is:

P(A > B) = \frac{e^{\beta_A}}{e^{\beta_A} + e^{\beta_B}}

Which, if you look closely, is just a sigmoid of their rating difference. Elegant. Stable. Only 72 years old. Yes, only 72.

Then I opened the most-cited AI alignment paper of 2023, Stanford’s DPO, and found the same equation on page 4.

Not a metaphor. Not “inspired by.” Literally the same formula.

The equation

Here’s the thing. When you ask Chatbot Arena “which response is better?” and pick one, your vote updates a Bradley-Terry model. The exact model chess federations have used since the 1960s. The exact model that powers FACEIT, Glicko-2, and every ranked ladder you’ve ever climbed. Or heard of. Or are hearing about it the very first time.

When DPO (Direct Preference Optimization) trains a language model on human preferences, the loss function IS the Bradley-Terry likelihood. Given a preferred response $y_w$ and a rejected response $y_l$ , the DPO objective is:

\mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]

That $\sigma(\cdot)$ in the center is the sigmoid from Bradley-Terry. The model learns by treating each training pair, “this response is better than that one”, exactly like a chess game where response A beat response B. The rating update and the gradient update are structurally the same operation.

The paper’s title says it: “Your Language Model is Secretly a Reward Model.”

It could equally say: “Your Language Model is Secretly an Elo Rating System.”

It goes deeper than pairs

Bradley-Terry handles pairs. A vs B. But humans naturally rank things in lists, “this is best, this is second, this is worst.” In competitive gaming, this happens every match: 10 players ranked 1st through 10th by performance.

The generalization is Plackett-Luce (1975). The probability of a specific ranking order $(a_1 \succ a_2 \succ \cdots \succ a_k)$ is:

P(a_1 \succ a_2 \succ \cdots \succ a_k) = \prod_{j=1}^{k} \frac{e^{\beta_{a_j}}}{\sum_{l=j}^{k} e^{\beta_{a_l}}}

Each position in the ranking is a sequential choice: “from the remaining players, who was best?” You multiply those conditional probabilities across the whole order. It’s the exact model that powers free-for-all rating in multiplayer games, OpenSkill uses it. Google DeepMind’s LiPO paper (NAACL 2025) formally proved the connection:

DPO = Bradley-Terry pairwise loss = Elo update.
LiPO = Plackett-Luce listwise loss = OpenSkill FFA update.

Same math. Different variable names. One side ranks players. The other side trains AI.

The behavioral layer

It gets weirder.

KTO (Kahneman-Tversky Optimization, ICML 2024) brings prospect theory into AI alignment, the idea that losses hurt ~2× more than equivalent gains feel good. The alignment objective is adjusted to weight rejected outputs more heavily than preferred ones by a calibrated asymmetric factor.

HLTV’s Rating 2.0 for CS2, reverse-engineered by the community, has a death penalty coefficient of -0.5329 and a kill reward of +0.3591. Deaths are punished at 1.48× the rate kills are rewarded. The system independently discovered loss aversion through empirical calibration against expert rankings, without ever citing Kahneman.

Behavioral economics in AI alignment. Behavioral economics in esports metrics. Neither team read the other’s paper.

Here’s what’s important: every one of these systems, Elo, DPO, Chatbot Arena, HLTV, OpenSkill, is trained on outcomes that already happened. The Bradley-Terry likelihood is computed over past comparisons. The Plackett-Luce product is a retrospective probability over observed rankings. The gradient always flows backward through history.

This is baked into the math. A rating $\beta_i$ converges to the value that best explains the outcomes you’ve already seen. That’s what maximum likelihood estimation does. It can’t do otherwise.

Which is why something Strauss Zelnick said recently landed hard. We got me saying this before GTA VI (ifykyk ;).

Zelnick runs Take-Two, the company behind GTA, NBA 2K, Civilization. One of the few entertainment CEOs who’s consistently delivered decade-over-decade. He was being pressed on whether AI can replace creative work in games, to which his response was:

“Data sets by their very nature are backward-looking. Creativity by its very nature is forward-looking.”

Then he took it up a notch, and something sharper:

“Asset creation is a necessary but insufficient condition for hit creation.”

“All hits are by their very nature unexpected. Things that are data-driven in their entirety can’t be unexpected.”

He’s describing exactly the mathematical constraint above. AI optimizes the Bradley-Terry likelihood over past human preferences. It gets extremely good at generating outputs that resemble what was preferred before. But a hit, in games, in music, in models, is something that surprises the distribution it was trained on. You cannot maximize a likelihood function toward the unexpected. The math works against you.

Zelnick put it practically: “If I told you, with this technology, you can create something that looks exactly like GTA, it won’t be GTA. Maybe a clone of GTA. Clones don’t sell.”

The LLM analog: a model that scores #1 on today’s Arena leaderboard is a very good clone of what humans preferred yesterday. Tomorrow’s breakthrough will surprise that leaderboard. Always.

The unsolved problem

The backward-looking measurement is solved. Bradley-Terry works. Has worked for 72 years. Will keep working.

The unsolved problem is forward-looking: identifying who, or what, will be exceptional before they are. Not by extrapolating past performance, “derivative properties don’t work,” as Zelnick puts it, but by detecting something the rating can’t capture.

Learning velocity. Adaptability when conditions change. Decision-making under novel pressure. The ability to be great at something that doesn’t exist yet.

In gaming: the player who will dominate a meta that hasn’t been discovered.
In AI: the architecture that will outperform on tasks we haven’t imagined.
In your career: the skill that will matter in the job that doesn’t have a title yet.

The 72-year-old equation ranks everyone who’s already played. It just can’t tell you who’s next.

That’s the interesting problem.

The equation

It goes deeper than pairs

The behavioral layer

The blind spot they share

The unsolved problem

Further reading