Self-Attention
Click on a tile to change the color scheme:
Objective:
1. Structure
1.1 Attention score
Compute attention score \(\alpha\):
- query, key and value
- use query to perform a query, i.e. query multiplies each of the key of others, to produce a weight parameter \(\alpha\) which is then used to "weigh" the values.
1.2 Extract info
1.3 Presented by Matrix multiplication
(From input I to output O.)
2. Multi-head Self-attention
Concept: different types of relevance.
2.1 Positional Encoding
Add a vector to each input embedding, which helps the model to:
- determine the position of each word
- determine the distance between different words in the sequence
3. Relationship with Other models
3.1 CNN
CNN is a subset of self-attention. So we can select the better one according to the size of dataset!
3.2 RNN
Self-attention can replace RNN nowadays due to its advantages:
- RNN(LSTM) cannot remember effectively when the sequence is relatively long. However, self-attention builds connections between every input vectors.
- Though self-attention seems more computational expensive, it is parallel! So actually it is faster than RNN (which is not parallel) with GPUs.
3.3 GNN
3.4 More
Last update:
June 16, 2023
Authors: