RNN
Click on a tile to change the color scheme:
Process sequences and build models with various inputs or outputs:
- 1 to many
- many to 1
- many to many (e.g. translation; seq2seq)
1. Basis
- Process entries of vector x as a sequence
- Add dense layer taking all of the h as input to yield an output
2. vanilla RNN
2.1 Character-level Language Model
During tests:
Question:
- Why sampling according to the probability distribution given by softmax, instead of taking the letter with the highest score?
- Softmax sampling increases the diversity of the outputs
2.2 Truncated BP through time
This is like a sliding window mechanism. We only focus on the data in a window during an iteration.
3. Case study
We want the model to learn to predict the following characters, but it also learns many things about the structural features of the input data.
3.1 Image to description
- Input: image
- Output: description of the image
- image -> CNN -> summary vector (4096 dim vector \(\vec{v}\), instead of softmax) -> RNN -> words
-
add image info by adding v and a 3rd weight matrix into the RNN model
-
Get a distribution of every words in the vocabulary, and sample from it.
- The input of the 1st step is a START token.
- The sample result serves as the input at the next step.
- Once sampling a END token, stop generation.
- Available dataset: COCO from Microsoft
3.2 Attention
RNN focuses its attention at a different spatial location when generating each word.
- Input: Weighted features & sampled word
- Output: distribution over locations & distribution over vocab
- \(a_i\): vectors of attention, telling the model where to focus (generating weighted features)
- Soft attention: weighted distribution over all locations
- Hard attention: force the model to select exactly one location
- Problem: NOT a differentiable function
- Solution: (see below)
- Pros: the model can focus on the meaningful part in the image
Notes for Soft & hard attention as a complement:
-
The attention module has 2 inputs:
-
a context: we use the hidden state \(h_{t - 1}\) from the previous time step
-
image features in each localized areas: one of the fully connected layer output \(x\) from CNN
- Nevertheless, we need to keep the spatial information:
- Use the feature maps of one of the convolution layer which spatial information is still preserved
-
Soft attention
-
input weighted image features accounted for attention
-
-
\(s_i\) is jointly decided by context and the image
-
same number for \(s_i\) and \(x_i\)
- The accuracy is subject to the assumption that the weighted average is a good representation for the area of attention
-
-
Hard attention
- Instead of a weighted average, hard attention uses \(\alpha_i\) as a sample rate to pick one \(x_i\) as the input to the LSTM.
- So finally we only choose (by sampling!) one part of the image, \(x_i\), instead of a weighted average.
- How to calculate the gradient descent correctly?
- Perform samplings and average our results using the Monte Carlo method.
- The accuracy is subject to how many samplings are performed and how well it is sampled.
- Related to RL; Estimate the gradient by Monte Carlo method.
3.3 Visual Question Answering: RNN with attention
Visual7W: Grounded Question Answering in Images
- How to combine the encoded image vector with encoded question vector?
- Most common: Connect / concatenate them together directly and feed them into fully connected layers
- Sometimes: Vector multiplication
- Input and encoding: Image \(I\) and Question sequence \(Q\) consists of words \(t_i\)
- F(·) transforms an image I from pixel space to a 4096-dimensional feature representation (extract the activations from the last fully connected layer (fc7) of a pre-trained CNN model VGG-16)
- OH(·) transforms a word token to its one-hot representation
- \(W_i\) and \(W_w\) transform the representations of image and words into embedding spaces with the same dimension. (Finally they are vectors in the same dimensional space: \(v_0, \dots v_m\).)
- Feed the vectors into the LSTM model one by one.
- LSTM with attention
- Output and decoding:
- At the decoding stage, it computes the log-likelihood of an answer (while in the encoding stage we feed the question in the model) by a dot product (which can reflect the similarity of two vectors) between its transformed visual feature (fc7 from CNN) and the last LSTM hidden state.
4. Multilayer RNN
Usually 2~4 layers for RNN is good enough.
5. Gradient flow
Problem: too many W in the gradient! (especially for \(h_0\))
5.1 Gradient explosion and vanishing gradient
5.2 LSTM (Long Short Term Memory)
Four gates:
\(c_t\): Cell state
\(h_t\): Hidden state
Pros compared with vanilla RNN:
- Forget gate can vary from each time step, unlike the W is consistent in the vanilla RNN. So the model can avoid gradient explosion or vanishing.
- Sigmoid for f, so the output falls in \((0, 1)\).
Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass
5.3 GRU
Use only one gate to balance the history and the new data.
Performs similarly to LSTM but is computationally cheaper.