Skip to content

Self-Attention

Click on a tile to change the color scheme:

Screen Shot 2021-05-10 at 7.16.44 PM

Objective:

Screen Shot 2021-05-10 at 7.27.03 PM

1. Structure

1.1 Attention score

Compute attention score \(\alpha\):

Screen Shot 2021-05-10 at 7.32.26 PM

  • query, key and value
  • use query to perform a query, i.e. query multiplies each of the key of others, to produce a weight parameter \(\alpha\) which is then used to "weigh" the values.

1.2 Extract info

Screen Shot 2021-05-10 at 7.33.18 PM

1.3 Presented by Matrix multiplication

Screen Shot 2021-05-10 at 7.49.05 PM

(From input I to output O.)

2. Multi-head Self-attention

Concept: different types of relevance.

Screen Shot 2021-05-10 at 7.51.44 PM

Screen Shot 2021-05-10 at 7.52.17 PM

2.1 Positional Encoding

Screen Shot 2021-05-10 at 7.55.46 PM

Add a vector to each input embedding, which helps the model to:

  • determine the position of each word
  • determine the distance between different words in the sequence

Screen Shot 2021-05-10 at 8.59.10 PM

3. Relationship with Other models

3.1 CNN

CNN is a subset of self-attention. So we can select the better one according to the size of dataset!

Screen Shot 2021-05-10 at 8.05.30 PM

3.2 RNN

Self-attention can replace RNN nowadays due to its advantages:

  • RNN(LSTM) cannot remember effectively when the sequence is relatively long. However, self-attention builds connections between every input vectors.
  • Though self-attention seems more computational expensive, it is parallel! So actually it is faster than RNN (which is not parallel) with GPUs.

Screen Shot 2021-05-10 at 8.10.33 PM

3.3 GNN

Screen Shot 2021-05-10 at 8.18.18 PM

3.4 More

Screen Shot 2021-05-10 at 8.29.25 PM


Last update: June 16, 2023
Authors: Colin