RNN

Click on a tile to change the color scheme:

Process sequences and build models with various inputs or outputs:

1 to many
many to 1
many to many (e.g. translation; seq2seq)

1. Basis

Screen Shot 2021-04-18 at 1.16.41 AM

Screen Shot 2021-04-18 at 1.23.28 AM

Process entries of vector x as a sequence
Add dense layer taking all of the h as input to yield an output

2. vanilla RNN

Screen Shot 2021-04-18 at 1.25.57 AM

2.1 Character-level Language Model

Screen Shot 2021-04-18 at 1.30.18 AM

During tests:

Screen Shot 2021-04-18 at 1.33.30 AM

Question:

Why sampling according to the probability distribution given by softmax, instead of taking the letter with the highest score?
Softmax sampling increases the diversity of the outputs

2.2 Truncated BP through time

Screen Shot 2021-04-18 at 1.43.05 AM

This is like a sliding window mechanism. We only focus on the data in a window during an iteration.

3. Case study

We want the model to learn to predict the following characters, but it also learns many things about the structural features of the input data.

3.1 Image to description

Screen Shot 2021-04-18 at 9.45.58 AM

Input: image
Output: description of the image
image -> CNN -> summary vector (4096 dim vector \(\vec{v}\), instead of softmax) -> RNN -> words
add image info by adding v and a 3rd weight matrix into the RNN model
Get a distribution of every words in the vocabulary, and sample from it.
The input of the 1st step is a START token.
The sample result serves as the input at the next step.
Once sampling a END token, stop generation.
Available dataset: COCO from Microsoft

3.2 Attention

RNN focuses its attention at a different spatial location when generating each word.

Screen Shot 2021-04-18 at 10.06.11 AM

Input: Weighted features & sampled word
Output: distribution over locations & distribution over vocab
\(a_i\): vectors of attention, telling the model where to focus (generating weighted features)
Soft attention: weighted distribution over all locations
Hard attention: force the model to select exactly one location
Problem: NOT a differentiable function
Solution: (see below)
Pros: the model can focus on the meaningful part in the image

Notes for Soft & hard attention as a complement:

The attention module has 2 inputs:
a context: we use the hidden state \(h_{t - 1}\) from the previous time step
image features in each localized areas: one of the fully connected layer output \(x\) from CNN
- Nevertheless, we need to keep the spatial information:
- Use the feature maps of one of the convolution layer which spatial information is still preserved
Soft attention
input weighted image features accounted for attention
- \(s_i\) is jointly decided by context and the image
- same number for \(s_i\) and \(x_i\)
- The accuracy is subject to the assumption that the weighted average is a good representation for the area of attention
Hard attention
Instead of a weighted average, hard attention uses \(\alpha_i\) as a sample rate to pick one \(x_i\) as the input to the LSTM.
So finally we only choose (by sampling!) one part of the image, \(x_i\), instead of a weighted average.
How to calculate the gradient descent correctly?
- Perform samplings and average our results using the Monte Carlo method.
- The accuracy is subject to how many samplings are performed and how well it is sampled.
- Related to RL; Estimate the gradient by Monte Carlo method.

3.3 Visual Question Answering: RNN with attention

Screen Shot 2021-04-18 at 10.19.56 AM

Visual7W: Grounded Question Answering in Images

How to combine the encoded image vector with encoded question vector?
Most common: Connect / concatenate them together directly and feed them into fully connected layers
Sometimes: Vector multiplication
Input and encoding: Image \(I\) and Question sequence \(Q\) consists of words \(t_i\)
F(·) transforms an image I from pixel space to a 4096-dimensional feature representation (extract the activations from the last fully connected layer (fc7) of a pre-trained CNN model VGG-16)
OH(·) transforms a word token to its one-hot representation
\(W_i\) and \(W_w\) transform the representations of image and words into embedding spaces with the same dimension. (Finally they are vectors in the same dimensional space: \(v_0, \dots v_m\).)
Feed the vectors into the LSTM model one by one.
LSTM with attention
Output and decoding:
At the decoding stage, it computes the log-likelihood of an answer (while in the encoding stage we feed the question in the model) by a dot product (which can reflect the similarity of two vectors) between its transformed visual feature (fc7 from CNN) and the last LSTM hidden state.

4. Multilayer RNN

Screen Shot 2021-04-18 at 10.21.54 AM

Usually 2~4 layers for RNN is good enough.

5. Gradient flow

Problem: too many W in the gradient! (especially for \(h_0\))

5.1 Gradient explosion and vanishing gradient

Screen Shot 2021-04-18 at 10.32.31 AM

5.2 LSTM (Long Short Term Memory)

Four gates:

Screen Shot 2021-04-18 at 10.40.06 AM

\(c_t\): Cell state

\(h_t\): Hidden state

Screen Shot 2021-04-18 at 10.44.42 AM

Pros compared with vanilla RNN:

Screen Shot 2021-04-18 at 10.50.38 AM

Forget gate can vary from each time step, unlike the W is consistent in the vanilla RNN. So the model can avoid gradient explosion or vanishing.
Sigmoid for f, so the output falls in \((0, 1)\).

Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass

漫谈LSTM系列的梯度问题

LSTM单元梯度的详细的数学推导

5.3 GRU

Use only one gate to balance the history and the new data.

Performs similarly to LSTM but is computationally cheaper.

Last update: June 16, 2023

Authors: Colin