← Back to writing

2026-05-24 · 7 min read

Reinforcement Learning, PPO, and GAE, the version my brain actually understands

The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.”

Alvin Toffler

Most explanations of Reinforcement Learning start with symbols.

You see:

A(s,a)A(s,a)

V(s)V(s)

δt\delta_t

Then ten minutes later you are staring at:

AtGAE=δt+(γλ)δt+1+A_t^{GAE} = \delta_t + (\gamma\lambda)\delta_{t+1} + \dots

and pretending you understand.

The problem is usually not the math.

The problem is that nobody explains why these things exist.

So here is the version that finally made sense to me.

No assumptions.

No “you should already know actor-critic.”

Just intuition.


Start with a dog

You have a dog.

You want the dog to sit.

You try:

  • dog sits → give treat
  • dog jumps → no treat

After enough repetitions:

dog learns:

sitting gets rewards

That is Reinforcement Learning.

At its core:

An agent tries things and learns from rewards.


Now replace dog with game AI.

Every moment:

  1. observe world
  2. choose action
  3. receive reward
  4. repeat

Example:

Mario sees:

  • coin
  • pit

Action:

Jump right

Reward:

+1

Simple.


Some words before we continue:

Agent

Thing learning.

Examples:

  • dog
  • robot
  • Mario
  • trading bot

Environment

The world.

Examples:

  • game
  • stock market
  • simulator

State

Everything the agent sees right now.

Coin.

Pit.

Enemy.

That snapshot:

state.


Action

What agent chooses.

Examples:

  • jump
  • move left
  • buy
  • accelerate

Reward

Feedback from world.

Examples:

+100 win

−50 crash

+1 coin

Reward says:

good

or

bad


So far life is easy:

state

action

reward

repeat


First problem: rewards are noisy

Suppose Mario jumps right.

Run 1:

+1 coin

+100 later

Total:

101

Good.

Now another run:

+1 coin

enemy attack −20

+100 later

Total:

81

Run 3:

+1 coin

bonus +40

+100 later

Total:

141

Same action.

Different outcomes.

Lots of randomness.

This is called:

Variance.

Meaning:

results jump around.


High variance creates a problem.

Suppose the agent asks:

Was jumping right good?

Answer:

maybe

That is terrible feedback.


We need expectations

Suppose from this exact situation:

Mario usually gets:

100 points

Today:

Mario got:

105

Question:

Should action receive praise?

Not:

105

Only:

5

Because:

100 was expected anyway.

Action only improved things by 5.

That difference matters.

That idea becomes:

Advantage.

Advantage asks:

Was this action better than normal?


Positive:

good action

Negative:

bad action

Zero:

action changed nothing


Example:

Expected:

100

Actual:

120

Advantage:

20

Meaning:

action did 20 better than expected.


But where does “expected” come from?

We train another network.

Not policy.

Different job.

This network predicts:

how much reward usually comes from here?

This is the Value Network.

Mario sees:

  • coin
  • pit
  • enemy

Value network says:

probably around 100 reward from here

That prediction is:

V(s)V(s)

Ignore notation.

Just read:

value of situation.


Policy and Value are different things

People mix these constantly.

Policy asks:

What should I do?

Value asks:

How good is current situation?

Policy chooses.

Value judges.

Different jobs.


Now another problem appears

Suppose value network predicted:

100

Reality happens.

Reward now:

2

Future still looks worth:

110

But future should matter slightly less.

Because future is uncertain.

Maybe Mario dies.

Maybe something changes.

Maybe reward never comes.

So Reinforcement Learning discounts future rewards.

That discount is:

γ\gamma

Gamma.


Gamma is basically a patience knob.

Low gamma:

impatient agent

High gamma:

patient agent


Examples:

γ=0\gamma=0

Agent says:

I only care about now

Goldfish mode.


γ=1\gamma=1

Agent says:

Present and future matter equally

Infinite patience.


γ=0.99\gamma=0.99

Agent says:

future matters a lot

Most PPO implementations use something around this.


Suppose:

Reward now:

2

Future estimate:

110

Gamma:

0.9

Future becomes:

99

because:

110 × .9

New estimate:

101

Old prediction:

100

Difference:

1

Reality was slightly better.

That surprise becomes:

TD Error.

Temporal Difference Error.

Sounds terrifying.

Actually simple.

TD error asks:

How surprised was I?

Big positive:

Things better than expected.

Big negative:

Things worse than expected.

Small:

Prediction good.


Formula:

δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

Ignore symbols.

Read:

reward now

plus discounted future

minus old belief

or even simpler:

new evidence

minus old prediction


But now TD becomes too short-sighted

Suppose reward arrives late.

Step:

0

reward:

0

Step:

1

reward:

0

Step:

2

reward:

100

One-step TD sees:

0

0

0

Not ideal.

Huge reward is coming.

But immediate estimates miss it.

Too short-sighted.


So we try looking farther.

Maybe:

2 steps

Maybe:

5 steps

Maybe:

10 steps

Longer look:

more truthful

more noisy

Shorter look:

more stable

less accurate

Same tradeoff again.


Bias vs Variance

RL fights this battle constantly.

Variance:

randomness

Bias:

consistently wrong

Imagine darts.

High variance:

shots everywhere.

Low variance:

shots clustered.

Bias:

all clustered left.

Wrong in the same direction every time.


Monte Carlo:

wait until episode ends

Very truthful

Very noisy

High variance


TD:

estimate immediately

Stable

Less truthful

Higher bias


Both fail differently.


This is why GAE exists

Generalized Advantage Estimation.

Ignore terrifying name.

Idea is tiny.

Instead of choosing:

1 step

or

5 steps

or

10 steps

Use all of them.

Mix them.

Take:

current surprise

plus some future surprise

plus a little less future surprise

plus even less future surprise

Closer information matters more.

Far future matters less.


Weighting controlled by:

λ\lambda

Lambda.

Another knob.


Low lambda:

trust nearby information

Low variance

High bias


High lambda:

trust farther future

Lower bias

Higher variance


Most PPO implementations use:

λ=.95\lambda=.95

Middle ground.


Full GAE:

AtGAE=δt+(γλ)δt+1+(γλ)2δt+2+A_t^{GAE} = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + \dots

Ignore symbols.

Read:

Current surprise

plus future surprise

plus smaller future surprise

plus even smaller future surprise

That is all.

Seriously.

The giant scary formula is just weighted future corrections.


Entire PPO mental model

Policy:

what action should I take?

Value:

how good is state?

TD:

was prediction wrong?

GAE:

combine future corrections

Advantage:

how much better than expected was action?

PPO:

repeat actions that repeatedly outperform expectations


Entire Reinforcement Learning note in one sentence:

An agent tries actions, estimates how good situations are, checks whether reality was better or worse than expected, combines those corrections across time, and slowly shifts toward actions that consistently do better than expected.

That’s the whole picture.


A note on how this page is built

This post is formatted to keep cognitive load low. A few things going on:

  • Bionic Reading. The first part of each word is bolded so your eyes can fixate on the word shape and skim less consciously.
  • One idea per line. Short blocks mean you never hold five clauses in working memory at once.
  • Concrete before abstract. The dog and Mario come before V(s)V(s) and δt\delta_t, so each symbol attaches to something you already pictured.
  • Why before notation. Every concept earns its symbol by first explaining why it needs to exist.
  • Real math, rendered. The equations are typeset with KaTeX instead of raw LaTeX, so the scary formula is the same idea you just read in words.