2026-05-24 · 7 min read
Reinforcement Learning, PPO, and GAE, the version my brain actually understands
“The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.”
— Alvin Toffler
Most explanations of Reinforcement Learning start with symbols.
You see:
Then ten minutes later you are staring at:
and pretending you understand.
The problem is usually not the math.
The problem is that nobody explains why these things exist.
So here is the version that finally made sense to me.
No assumptions.
No “you should already know actor-critic.”
Just intuition.
Start with a dog
You have a dog.
You want the dog to sit.
You try:
- dog sits → give treat
- dog jumps → no treat
After enough repetitions:
dog learns:
sitting gets rewards
That is Reinforcement Learning.
At its core:
An agent tries things and learns from rewards.
Now replace dog with game AI.
Every moment:
- observe world
- choose action
- receive reward
- repeat
Example:
Mario sees:
- coin
- pit
Action:
Jump right
Reward:
+1
Simple.
Some words before we continue:
Agent
Thing learning.
Examples:
- dog
- robot
- Mario
- trading bot
Environment
The world.
Examples:
- game
- stock market
- simulator
State
Everything the agent sees right now.
Coin.
Pit.
Enemy.
That snapshot:
state.
Action
What agent chooses.
Examples:
- jump
- move left
- buy
- accelerate
Reward
Feedback from world.
Examples:
+100 win
−50 crash
+1 coin
Reward says:
good
or
bad
So far life is easy:
state
↓
action
↓
reward
↓
repeat
First problem: rewards are noisy
Suppose Mario jumps right.
Run 1:
+1 coin
+100 later
Total:
101
Good.
Now another run:
+1 coin
enemy attack −20
+100 later
Total:
81
Run 3:
+1 coin
bonus +40
+100 later
Total:
141
Same action.
Different outcomes.
Lots of randomness.
This is called:
Variance.
Meaning:
results jump around.
High variance creates a problem.
Suppose the agent asks:
Was jumping right good?
Answer:
maybe
That is terrible feedback.
We need expectations
Suppose from this exact situation:
Mario usually gets:
100 points
Today:
Mario got:
105
Question:
Should action receive praise?
Not:
105
Only:
5
Because:
100 was expected anyway.
Action only improved things by 5.
That difference matters.
That idea becomes:
Advantage.
Advantage asks:
Was this action better than normal?
Positive:
good action
Negative:
bad action
Zero:
action changed nothing
Example:
Expected:
100
Actual:
120
Advantage:
20
Meaning:
action did 20 better than expected.
But where does “expected” come from?
We train another network.
Not policy.
Different job.
This network predicts:
how much reward usually comes from here?
This is the Value Network.
Mario sees:
- coin
- pit
- enemy
Value network says:
probably around 100 reward from here
That prediction is:
Ignore notation.
Just read:
value of situation.
Policy and Value are different things
People mix these constantly.
Policy asks:
What should I do?
Value asks:
How good is current situation?
Policy chooses.
Value judges.
Different jobs.
Now another problem appears
Suppose value network predicted:
100
Reality happens.
Reward now:
2
Future still looks worth:
110
But future should matter slightly less.
Because future is uncertain.
Maybe Mario dies.
Maybe something changes.
Maybe reward never comes.
So Reinforcement Learning discounts future rewards.
That discount is:
Gamma.
Gamma is basically a patience knob.
Low gamma:
impatient agent
High gamma:
patient agent
Examples:
Agent says:
I only care about now
Goldfish mode.
Agent says:
Present and future matter equally
Infinite patience.
Agent says:
future matters a lot
Most PPO implementations use something around this.
Suppose:
Reward now:
2
Future estimate:
110
Gamma:
0.9
Future becomes:
99
because:
110 × .9
New estimate:
101
Old prediction:
100
Difference:
1
Reality was slightly better.
That surprise becomes:
TD Error.
Temporal Difference Error.
Sounds terrifying.
Actually simple.
TD error asks:
How surprised was I?
Big positive:
Things better than expected.
Big negative:
Things worse than expected.
Small:
Prediction good.
Formula:
Ignore symbols.
Read:
reward now
plus discounted future
minus old belief
or even simpler:
new evidence
minus old prediction
But now TD becomes too short-sighted
Suppose reward arrives late.
Step:
0
reward:
0
Step:
1
reward:
0
Step:
2
reward:
100
One-step TD sees:
0
0
0
Not ideal.
Huge reward is coming.
But immediate estimates miss it.
Too short-sighted.
So we try looking farther.
Maybe:
2 steps
Maybe:
5 steps
Maybe:
10 steps
Longer look:
more truthful
more noisy
Shorter look:
more stable
less accurate
Same tradeoff again.
Bias vs Variance
RL fights this battle constantly.
Variance:
randomness
Bias:
consistently wrong
Imagine darts.
High variance:
shots everywhere.
Low variance:
shots clustered.
Bias:
all clustered left.
Wrong in the same direction every time.
Monte Carlo:
wait until episode ends
Very truthful
Very noisy
High variance
TD:
estimate immediately
Stable
Less truthful
Higher bias
Both fail differently.
This is why GAE exists
Generalized Advantage Estimation.
Ignore terrifying name.
Idea is tiny.
Instead of choosing:
1 step
or
5 steps
or
10 steps
Use all of them.
Mix them.
Take:
current surprise
plus some future surprise
plus a little less future surprise
plus even less future surprise
Closer information matters more.
Far future matters less.
Weighting controlled by:
Lambda.
Another knob.
Low lambda:
trust nearby information
Low variance
High bias
High lambda:
trust farther future
Lower bias
Higher variance
Most PPO implementations use:
Middle ground.
Full GAE:
Ignore symbols.
Read:
Current surprise
plus future surprise
plus smaller future surprise
plus even smaller future surprise
That is all.
Seriously.
The giant scary formula is just weighted future corrections.
Entire PPO mental model
Policy:
what action should I take?
↓
Value:
how good is state?
↓
TD:
was prediction wrong?
↓
GAE:
combine future corrections
↓
Advantage:
how much better than expected was action?
↓
PPO:
repeat actions that repeatedly outperform expectations
Entire Reinforcement Learning note in one sentence:
An agent tries actions, estimates how good situations are, checks whether reality was better or worse than expected, combines those corrections across time, and slowly shifts toward actions that consistently do better than expected.
That’s the whole picture.
A note on how this page is built
This post is formatted to keep cognitive load low. A few things going on:
- Bionic Reading. The first part of each word is bolded so your eyes can fixate on the word shape and skim less consciously.
- One idea per line. Short blocks mean you never hold five clauses in working memory at once.
- Concrete before abstract. The dog and Mario come before and , so each symbol attaches to something you already pictured.
- Why before notation. Every concept earns its symbol by first explaining why it needs to exist.
- Real math, rendered. The equations are typeset with KaTeX instead of raw LaTeX, so the scary formula is the same idea you just read in words.