Skip to content

Reinforcement Learning

Click on a tile to change the color scheme:

1. Mathematical Formulation

1.1 Markov Decision Process

Screen Shot 2021-04-25 at 8.02.02 PM

Screen Shot 2021-04-25 at 8.02.26 PM

Screen Shot 2021-04-25 at 8.09.23 PM

1.2 Value function and Q-value function

Screen Shot 2021-04-25 at 8.50.34 PM

Screen Shot 2021-04-25 at 8.51.15 PM

1.3 Q-learning

Screen Shot 2021-04-25 at 8.43.46 PM

Value iteration algorithm: Use Bellman equation as an iterative update

Screen Shot 2021-04-25 at 8.46.59 PM

Qi will converge to Q* as i -> infinity !

What’s the problem with this?

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!

Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

1.4 Deep Q-learning

Screen Shot 2021-04-25 at 8.55.08 PM

Screen Shot 2021-04-25 at 8.59.49 PM

1.4.1 Q-network

1.4.1.1 Architecture

Screen Shot 2021-04-25 at 9.03.54 PM

1.4.1.2 Experience Replay

Screen Shot 2021-04-25 at 9.22.57 PM

Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair

But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

2. Policy Gradients & REINFORCE Algorithm

Find the optimal policy without estimating the Q-value.

Screen Shot 2021-04-25 at 9.35.40 PM

Screen Shot 2021-04-25 at 9.39.47 PM

Screen Shot 2021-04-25 at 9.41.06 PM

???

Screen Shot 2021-04-25 at 9.44.44 PM

3. Variance reduction

Screen Shot 2021-04-25 at 10.32.09 PM

Screen Shot 2021-04-25 at 10.32.30 PM

4. Actor-Critic Algorithm

Screen Shot 2021-04-25 at 10.37.31 PM

Screen Shot 2021-04-25 at 10.43.52 PM

5. Recurrent Attention Model

Screen Shot 2021-04-25 at 10.52.51 PM

Screen Shot 2021-04-25 at 11.02.00 PM

6. AlphaGo

Screen Shot 2021-04-25 at 11.09.35 PM

7. Summary

Screen Shot 2021-04-25 at 11.21.35 PM


Last update: June 16, 2023
Authors: Colin