Skip to content

Policy Gradient

Click on a tile to change the color scheme:

Screen Shot 2021-05-12 at 2.42.35 PM

1. Definitions

  • Given an actor \(\pi_\theta(s)\) with network parameter \(\theta\)
  • Use the actor to play the game to gain rewards
  • Due to the randomness in the environment, even with the same actor, total reward is different each time
  • \(R_\theta\): total rewards
  • \(\overline{R_\theta}\): expected value of total rewards; evaluates the goodness of an actor

2. Formulation of Average Reward

We use sampling to estimate average reward:

Screen Shot 2021-05-12 at 2.50.01 PM

3. Optimization: Gradient Ascent

We need to optimize \(\theta\) to get an optimal reward:

Screen Shot 2021-05-12 at 2.52.56 PM

Actually, gradient of \(R\) to \(\theta\) is only related to:

Screen Shot 2021-05-12 at 2.56.01 PM

In the trajectory, (s, a) -> (r, s')

Screen Shot 2021-05-12 at 2.58.23 PM

Finally,

Screen Shot 2021-05-12 at 3.00.48 PM

Because

Screen Shot 2021-05-12 at 3.01.43 PM

So

Screen Shot 2021-05-12 at 3.02.08 PM

4. Summary

Screen Shot 2021-05-12 at 3.03.01 PM

5. Tips

5.1 Baseline

Use average value as baseline.

Screen Shot 2021-05-12 at 5.34.11 PM

5.2 Assign Suitable Credit

It's not very fair to assign the same credit to all of the actions in a trajectory. Optimizations:

  • Suffix sum of rewards
  • Discount factor

Screen Shot 2021-05-12 at 5.53.09 PM

5.3 Estimate by Network

We can use advantage function to represent \(R(\tau^n) - b\) , which can be estimated by a network!


Last update: May 18, 2021
Authors: Colin