Weight Initialization

Weight initialization is the starting point of training. Poor initialization can cause vanishing/exploding gradients or slow convergence. Good initialization helps networks train faster and achieve better accuracy.

Why Initialization Matters

If weights are too small, signals shrink as they propagate (vanishing gradients). If too large, signals explode. Proper initialization keeps signal variance stable across layers.

Common Initialization Methods

Random Normal: Sample from N(0, 0.01). Simple but often suboptimal.
Xavier/Glorot Initialization: Variance = 2 / (fan_in + fan_out). Works well for tanh and sigmoid activations.
He Initialization: Variance = 2 / fan_in. Designed for ReLU and its variants – most common for modern networks.

In Practice

Modern deep learning frameworks provide these initializers. For ReLU networks, use He initialization (Kaiming). For CNNs, also He.

# PyTorch
import torch.nn as nn
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

# TensorFlow/Keras
initializer = tf.keras.initializers.HeNormal()
layer = tf.keras.layers.Dense(units=64, kernel_initializer=initializer)

Biases

Bias can usually be initialized to zero. For some layers (e.g., the final layer), small positive bias can help if the output should be positive.

Two Minute Drill

Proper initialization prevents vanishing/exploding gradients.
Xavier for tanh/sigmoid, He (Kaiming) for ReLU.
Frameworks provide built‑in initializers.
He initialization is the default for modern CNNs.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Weight Initialization

Need more clarification?