Reinforcement Learning
Click on a tile to change the color scheme:
1. Mathematical Formulation
1.1 Markov Decision Process
1.2 Value function and Q-value function
1.3 Q-learning
Value iteration algorithm: Use Bellman equation as an iterative update
Qi will converge to Q* as i -> infinity !
What’s the problem with this?
Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!
Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!
1.4 Deep Q-learning
1.4.1 Q-network
1.4.1.1 Architecture
1.4.1.2 Experience Replay
Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair
But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?
2. Policy Gradients & REINFORCE Algorithm
Find the optimal policy without estimating the Q-value.
???