'what's the difference between "self-attention mechanism" and "full-connection" layer?
I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?
Solution 1:[1]
Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:
f(x) = (Wx)
where W
is learned in training, and fixed in inference.
Self-attention layers are dynamic, changing the weight as it goes:
attn(x) = (Wx)
f(x) = (attn(x) * x)
Again this is ignoring a lot of details but there are many different implementations for different applications and you should really check a paper for that.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |