Onkar Dahale

“The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.”

— Alvin Toffler

Most explanations of Reinforcement Learning start with symbols.

You see:

$A(s,a)$

$V(s)$

$\delta_t$

Then ten minutes later you are staring at:

$A_t^{GAE} = \delta_t + (\gamma\lambda)\delta_{t+1} + \dots$

and pretending you understand.

The problem is usually not the math.

The problem is that nobody explains why these things exist.

So here is the version that finally made sense to me.

No assumptions.

No “you should already know actor-critic.”

Just intuition.

Start with a dog

You have a dog.

You want the dog to sit.

You try:

dog sits → give treat
dog jumps → no treat

After enough repetitions:

dog learns:

sitting gets rewards

That is Reinforcement Learning.

At its core:

An agent tries things and learns from rewards.

Now replace dog with game AI.

Every moment:

observe world
choose action
receive reward
repeat

Example:

Mario sees:

coin
pit

Action:

Jump right

Reward:

Simple.

Some words before we continue:

Agent

Thing learning.

Examples:

dog
robot
Mario
trading bot

Environment

The world.

Examples:

game
stock market
simulator

State

Everything the agent sees right now.

Coin.

Pit.

Enemy.

That snapshot:

state.

Action

What agent chooses.

Examples:

jump
move left
buy
accelerate

Reward

Feedback from world.

Examples:

+100 win

−50 crash

+1 coin

Reward says:

good

bad

So far life is easy:

state

↓

action

↓

reward

↓

repeat

First problem: rewards are noisy

Suppose Mario jumps right.

Run 1:

+1 coin

+100 later

Total:

101

Good.

Now another run:

+1 coin

enemy attack −20

+100 later

Total:

Run 3:

+1 coin

bonus +40

+100 later

Total:

141

Same action.

Different outcomes.

Lots of randomness.

This is called:

Variance.

Meaning:

results jump around.

High variance creates a problem.

Suppose the agent asks:

Was jumping right good?

Answer:

maybe

That is terrible feedback.

We need expectations

Suppose from this exact situation:

Mario usually gets:

100 points

Today:

Mario got:

105

Question:

Should action receive praise?

Not:

105

Only:

Because:

100 was expected anyway.

Action only improved things by 5.

That difference matters.

That idea becomes:

Advantage.

Advantage asks:

Was this action better than normal?

Positive:

good action

Negative:

bad action

Zero:

action changed nothing

Example:

Expected:

100

Actual:

120

Advantage:

Meaning:

action did 20 better than expected.

But where does “expected” come from?

We train another network.

Not policy.

Different job.

This network predicts:

how much reward usually comes from here?

This is the Value Network.

Mario sees:

coin
pit
enemy

Value network says:

probably around 100 reward from here

That prediction is:

$V(s)$

Ignore notation.

Just read:

value of situation.

Policy and Value are different things

People mix these constantly.

Policy asks:

What should I do?

Value asks:

How good is current situation?

Policy chooses.

Value judges.

Different jobs.

Now another problem appears

Suppose value network predicted:

100

Reality happens.

Reward now:

Future still looks worth:

110

But future should matter slightly less.

Because future is uncertain.

Maybe Mario dies.

Maybe something changes.

Maybe reward never comes.

So Reinforcement Learning discounts future rewards.

That discount is:

$\gamma$

Gamma.

Gamma is basically a patience knob.

Low gamma:

impatient agent

High gamma:

patient agent

Examples:

$\gamma=0$

Agent says:

I only care about now

Goldfish mode.

$\gamma=1$

Agent says:

Present and future matter equally

Infinite patience.

$\gamma=0.99$

Agent says:

future matters a lot

Most PPO implementations use something around this.

Suppose:

Reward now:

Future estimate:

110

Gamma:

0.9

Future becomes:

because:

110 × .9

New estimate:

101

Old prediction:

100

Difference:

Reality was slightly better.

That surprise becomes:

TD Error.

Temporal Difference Error.

Sounds terrifying.

Actually simple.

TD error asks:

How surprised was I?

Big positive:

Things better than expected.

Big negative:

Things worse than expected.

Small:

Prediction good.

Formula:

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

Ignore symbols.

Read:

reward now

plus discounted future

minus old belief

or even simpler:

new evidence

minus old prediction

But now TD becomes too short-sighted

Suppose reward arrives late.

Step:

reward:

Step:

reward:

Step:

reward:

100

One-step TD sees:

Not ideal.

Huge reward is coming.

But immediate estimates miss it.

Too short-sighted.

So we try looking farther.

Maybe:

2 steps

Maybe:

5 steps

Maybe:

10 steps

Longer look:

more truthful

more noisy

Shorter look:

more stable

less accurate

Same tradeoff again.

Bias vs Variance

RL fights this battle constantly.

Variance:

randomness

Bias:

consistently wrong

Imagine darts.

High variance:

shots everywhere.

Low variance:

shots clustered.

Bias:

all clustered left.

Wrong in the same direction every time.

Monte Carlo:

wait until episode ends

Very truthful

Very noisy

High variance

TD:

estimate immediately

Stable

Less truthful

Higher bias

Both fail differently.

This is why GAE exists

Generalized Advantage Estimation.

Ignore terrifying name.

Idea is tiny.

Instead of choosing:

1 step

5 steps

10 steps

Use all of them.

Mix them.

Take:

current surprise

plus some future surprise

plus a little less future surprise

plus even less future surprise

Closer information matters more.

Far future matters less.

Weighting controlled by:

$\lambda$

Lambda.

Another knob.

Low lambda:

trust nearby information

Low variance

High bias

High lambda:

trust farther future

Lower bias

Higher variance

Most PPO implementations use:

$\lambda=.95$

Middle ground.

Full GAE:

$A_t^{GAE} = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + \dots$

Ignore symbols.

Read:

Current surprise

plus future surprise

plus smaller future surprise

plus even smaller future surprise

That is all.

Seriously.

The giant scary formula is just weighted future corrections.

Entire PPO mental model

Policy:

what action should I take?

↓

Value:

how good is state?

↓

TD:

was prediction wrong?

↓

GAE:

combine future corrections

↓

Advantage:

how much better than expected was action?

↓

PPO:

repeat actions that repeatedly outperform expectations

Entire Reinforcement Learning note in one sentence:

An agent tries actions, estimates how good situations are, checks whether reality was better or worse than expected, combines those corrections across time, and slowly shifts toward actions that consistently do better than expected.

That’s the whole picture.

A note on how this page is built

This post is formatted to keep cognitive load low. A few things going on:

Bionic Reading. The first part of each word is bolded so your eyes can fixate on the word shape and skim less consciously.
One idea per line. Short blocks mean you never hold five clauses in working memory at once.
Concrete before abstract. The dog and Mario come before $V(s)$ and $\delta_t$ , so each symbol attaches to something you already pictured.
Why before notation. Every concept earns its symbol by first explaining why it needs to exist.
Real math, rendered. The equations are typeset with KaTeX instead of raw LaTeX, so the scary formula is the same idea you just read in words.