Skip to content

Colin's Notebook

Policy Gradient

Policy Gradient

Click on a tile to change the color scheme:

Screen Shot 2021-05-12 at 2.42.35 PM

1. Definitions

Given an actor \(\pi_\theta(s)\) with network parameter \(\theta\)
Use the actor to play the game to gain rewards
Due to the randomness in the environment, even with the same actor, total reward is different each time
\(R_\theta\): total rewards
\(\overline{R_\theta}\): expected value of total rewards; evaluates the goodness of an actor

2. Formulation of Average Reward

We use sampling to estimate average reward:

Screen Shot 2021-05-12 at 2.50.01 PM

3. Optimization: Gradient Ascent

We need to optimize \(\theta\) to get an optimal reward:

Screen Shot 2021-05-12 at 2.52.56 PM

Actually, gradient of \(R\) to \(\theta\) is only related to:

Screen Shot 2021-05-12 at 2.56.01 PM

In the trajectory, (s, a) -> (r, s')

Screen Shot 2021-05-12 at 2.58.23 PM

Finally,

Screen Shot 2021-05-12 at 3.00.48 PM

Because

Screen Shot 2021-05-12 at 3.01.43 PM

So

Screen Shot 2021-05-12 at 3.02.08 PM

4. Summary

Screen Shot 2021-05-12 at 3.03.01 PM

5. Tips

5.1 Baseline

Use average value as baseline.

Screen Shot 2021-05-12 at 5.34.11 PM

5.2 Assign Suitable Credit

It's not very fair to assign the same credit to all of the actions in a trajectory. Optimizations:

Suffix sum of rewards
Discount factor

Screen Shot 2021-05-12 at 5.53.09 PM

5.3 Estimate by Network

We can use advantage function to represent \(R(\tau^n) - b\) , which can be estimated by a network!

Last update: May 18, 2021

Authors: Colin