Policy Gradient
Click on a tile to change the color scheme:
1. Definitions
- Given an actor \(\pi_\theta(s)\) with network parameter \(\theta\)
- Use the actor to play the game to gain rewards
- Due to the randomness in the environment, even with the same actor, total reward is different each time
- \(R_\theta\): total rewards
- \(\overline{R_\theta}\): expected value of total rewards; evaluates the goodness of an actor
2. Formulation of Average Reward
We use sampling to estimate average reward:
3. Optimization: Gradient Ascent
We need to optimize \(\theta\) to get an optimal reward:
Actually, gradient of \(R\) to \(\theta\) is only related to:
In the trajectory, (s, a) -> (r, s')
Finally,
Because
So
4. Summary
5. Tips
5.1 Baseline
Use average value as baseline.
5.2 Assign Suitable Credit
It's not very fair to assign the same credit to all of the actions in a trajectory. Optimizations:
- Suffix sum of rewards
- Discount factor
5.3 Estimate by Network
We can use advantage function to represent \(R(\tau^n) - b\) , which can be estimated by a network!
Last update:
May 18, 2021
Authors: