Attention & Transformers — Interview Q&A
Self-Attention
What is self-attention?
Self-attention allows each token in a sequence to weigh the relevance of all other tokens when computing its representation.
One-line:
Self-attention lets each token dynamically focus on other tokens to model global context.
Confusing self-attention with cross-attention.
Why is scaling used in dot-product attention?
Scaling by √dₖ prevents large dot-product values from pushing the softmax into saturation, which stabilizes gradients.
One-line:Scaling avoids softmax saturation and stabilizes training.
Multi-Head Attention
Why use multiple attention heads?
Multiple heads allow the model to attend to different representation subspaces and capture diverse relationships in parallel.
One-line:Multi-head attention learns different relationships simultaneously.