Self-Attention

Click on a tile to change the color scheme:

Screen Shot 2021-05-10 at 7.16.44 PM

Objective:

Screen Shot 2021-05-10 at 7.27.03 PM

1. Structure

Compute attention score \(\alpha\):

Screen Shot 2021-05-10 at 7.32.26 PM

query, key and value
use query to perform a query, i.e. query multiplies each of the key of others, to produce a weight parameter \(\alpha\) which is then used to "weigh" the values.

Screen Shot 2021-05-10 at 7.33.18 PM

Screen Shot 2021-05-10 at 7.49.05 PM

(From input I to output O.)

Concept: different types of relevance.

Screen Shot 2021-05-10 at 7.51.44 PM

Screen Shot 2021-05-10 at 7.52.17 PM

Screen Shot 2021-05-10 at 7.55.46 PM

Add a vector to each input embedding, which helps the model to:

Screen Shot 2021-05-10 at 8.59.10 PM

CNN is a subset of self-attention. So we can select the better one according to the size of dataset!

Screen Shot 2021-05-10 at 8.05.30 PM

Self-attention can replace RNN nowadays due to its advantages:

RNN(LSTM) cannot remember effectively when the sequence is relatively long. However, self-attention builds connections between every input vectors.
Though self-attention seems more computational expensive, it is parallel! So actually it is faster than RNN (which is not parallel) with GPUs.

Screen Shot 2021-05-10 at 8.10.33 PM

Screen Shot 2021-05-10 at 8.18.18 PM

Screen Shot 2021-05-10 at 8.29.25 PM

Last update: June 16, 2023

Authors: Colin