Attention & Transformers — Interview Q&A

Self-Attention

What is self-attention?

Self-attention allows each token in a sequence to weigh the relevance of all other tokens when computing its representation.

One-line:
Self-attention lets each token dynamically focus on other tokens to model global context.

Trap:
Confusing self-attention with cross-attention.
Why is scaling used in dot-product attention?

Scaling by √dₖ prevents large dot-product values from pushing the softmax into saturation, which stabilizes gradients.

One-line:
Scaling avoids softmax saturation and stabilizes training.

Multi-Head Attention

Why use multiple attention heads?

Multiple heads allow the model to attend to different representation subspaces and capture diverse relationships in parallel.

One-line:
Multi-head attention learns different relationships simultaneously.