Self-Attention
Click on a tile to change the color scheme:

Objective:

1. Structure
1.1 Attention score
Compute attention score \(\alpha\):

- query, key and value
- use query to perform a query, i.e. query multiplies each of the key of others, to produce a weight parameter \(\alpha\) which is then used to "weigh" the values.
1.2 Extract info

1.3 Presented by Matrix multiplication

(From input I to output O.)
2. Multi-head Self-attention
Concept: different types of relevance.


2.1 Positional Encoding

Add a vector to each input embedding, which helps the model to:
- determine the position of each word
- determine the distance between different words in the sequence

3. Relationship with Other models
3.1 CNN
CNN is a subset of self-attention. So we can select the better one according to the size of dataset!

3.2 RNN
Self-attention can replace RNN nowadays due to its advantages:
- RNN(LSTM) cannot remember effectively when the sequence is relatively long. However, self-attention builds connections between every input vectors.
- Though self-attention seems more computational expensive, it is parallel! So actually it is faster than RNN (which is not parallel) with GPUs.

3.3 GNN

3.4 More

Last update:
June 16, 2023
Authors: