Skip to content

Transformer

Click on a tile to change the color scheme:

(Ref: The Illustrated Transformer 图解 Transformer)

1. Seq2seq

Input a sequence, output a sequence.

The output length is determined by model.

Applications: too many. Selected:

syntactic parsing, multi-label classification, even object detection

2. Encoder

Summary:

  • 输入,进行 positional encoding ;
  • 经过多个相同结构的特征提取「模块」,包括:
    • multi-head self-attention
    • residual addition & norm
    • feed forward
    • residual addition & norm
  • 最后得到输出

Screen Shot 2021-05-10 at 7.00.04 PM

Structure of encoder:

Screen Shot 2021-05-10 at 6.55.19 PM

Detail of a block in an encoder:

Screen Shot 2021-05-10 at 6.55.38 PM

  • Self-attention: build connections between input layers
  • FC: increase the dim and them decrease (recover) the dim, to increase the ability of expression
  • Better design: change the place of Layer Normalization

Screen Shot 2021-05-10 at 8.40.49 PM

3. Decoder

Summary:

  • 输入是「顺序迭代型」的,即一开始底部输入 \<BOS> ,输出一个 token ,然后把这个 token 作为第二个输入。
  • 输入,首先是 positional encoding 。
  • 然后是几层结构相同的「模块」,包括:
    • masked multi-head attention (with residual addition & norm)
    • multi-head cross-attention (with residual addition & norm)
    • FFN (with residual addition & norm)
  • 最后通过 softmax 输出每个 token 可能性的预测。

3.1 Autoregressive

3.1.1 Overall

Screen Shot 2021-05-10 at 9.29.42 PM

3.1.2 Masked Self-attention

Screen Shot 2021-05-10 at 9.43.19 PM

3.2 Non-autoregressive

Screen Shot 2021-05-10 at 9.45.23 PM

NAT can control the output length.

Multi-modality:

4. Encoder-Decoder

Screen Shot 2021-05-10 at 10.34.57 PM

4.1 Cross Attention

Screen Shot 2021-05-10 at 10.36.04 PM

5. Training

5.1 Teacher Forcing

using the ground truth as input.

Screen Shot 2021-05-10 at 10.54.22 PM

  • loss: sum of cross entropy

5.2 Copy Mechanism

Copy something from the input to the output.

chat-bot, summarization of articles, ...

5.3 Guided Attention

Guide the way of attention to avoid stupid mistakes.

  • Monotonic attention
  • Location-aware attention

Screen Shot 2021-05-10 at 11.14.36 PM

Note: Randomness is needed for decoder when generating sequence in some (creative) tasks.

5.5 Optimizing Evaluation Metrics

5.5.1 BLEU score

equation

  • 分子:在给定的candidate中有多少个n-gram词语出现在reference中。
  • 分母:所有的candidate中n-gram的个数

We can use BLEU score in validation stage to select the best model.

However, it cannot be used to train because it is not differential so we can't optimize it as a loss function.

Rule: When you don’t know how to optimize, just use reinforcement learning (RL)!

5.6 Scheduled Sampling

5.6.1 Exposure Bias

There is a mismatch due to the teacher forcing mechanism!

Scheduled Sampling:


Last update: June 16, 2023
Authors: Co1lin (13.42%), Colin (86.58%)