colab

计算 weighted sum 的方式：

# using matrix multiply
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
out = wei @ x

# using softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

Notes:

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
Each example across batch dimension is of course processed completely independently and never “talk” to each other
In an “encoder” attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a “decoder” attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
“self-attention” just means that the keys and values are produced from the same source as queries. In “cross-attention”, the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
“Scaled” attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.

attention是多节点communication，feedforward是节点自己computation

pre-norm formulation：现今transformer比起论文的改变是把LayerNorm移到attention和feedforward之前进行。

参考

Let’s Build GPT

🧠 Brain

Explorer

let's build gpt

参考

Graph View

Backlinks

🧠 Brain

Explorer

let's build gpt

参考 §

Graph View

Backlinks

参考