Attention Mechanism

Basic Seq2Seq models compress the entire input into a single context vector. For long sequences, information gets lost. The attention mechanism allows the decoder to look back at all encoder hidden states and focus on relevant parts at each step.

Attention: at each decoder step, compute a weighted sum of encoder hidden states. Weights (attention scores) indicate which input words are most relevant.

How Attention Works

For each decoder step t:
1. Compute a score between decoder hidden state s_t and each encoder hidden state h_i.
2. Softmax the scores to get attention weights α_i.
3. Compute context vector c_t = Σ α_i * h_i.
4. Concatenate c_t with s_t, then predict the next token.

Common score functions: dot product, additive (Bahdanau), scaled dot product (Transformer).

Benefits of Attention

Handles long sequences (no information bottleneck).
Provides interpretability: see which input words the model focuses on.
Enables alignment (e.g., aligning English and French words in translation).

Attention in Transformers

The Transformer architecture (next module) uses self‑attention, where each word attends to all other words in the sequence. This is the foundation of modern LLMs like GPT and BERT.

Two Minute Drill

Attention lets decoder focus on different input parts at each step.
Computes weighted sum of encoder hidden states.
Improves long‑sequence handling and interpretability.
Foundation of Transformers and modern LLMs.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Attention Mechanism

Need more clarification?