Vanishing and Exploding Gradients

During backpropagation through time (BPTT), gradients are multiplied repeatedly by the weight matrix. If weights are less than 1, gradients vanish to zero. If greater than 1, they explode. This makes it hard for simple RNNs to learn long‑range dependencies.

Vanishing gradients: early time steps have negligible influence. Exploding gradients: training becomes unstable.

Why It Happens

In a long sequence, the gradient at time t depends on the product of many Jacobians. If the eigenvalues of the weight matrix are <1, the product decays exponentially (vanishing). If >1, it grows exponentially (exploding).

Consequences

Vanishing: Model cannot learn dependencies more than ~10 steps away (e.g., subject‑verb agreement across long sentences).
Exploding: Gradients become NaN; training diverges.

Solutions for Exploding Gradients

Gradient clipping: scale gradients if their norm exceeds a threshold (e.g., 1.0).

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Solutions for Vanishing Gradients

Use LSTM or GRU (next chapters) – designed with gating mechanisms to preserve long‑term information.
Use residual connections (skip connections).
Use ReLU activation instead of tanh/sigmoid.
Proper weight initialization (He, Xavier).

Two Minute Drill

Vanishing gradients: long‑term dependencies lost.
Exploding gradients: training diverges.
Clip gradients to fix exploding.
Use LSTM/GRU to fix vanishing.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Vanishing and Exploding Gradients

Need more clarification?