Optimizers
Optimizers update network weights to minimize the loss. While standard SGD works, advanced optimizers adapt the learning rate per parameter, leading to faster and more stable convergence.
SGD with Momentum
SGD with momentum adds a fraction of the previous update to the current update, smoothing the path and accelerating convergence in consistent directions.
v = βv + learning_rate * gradient
w = w – vβ is typically 0.9.AdaGrad
Adapts learning rate per parameter: larger updates for infrequent parameters, smaller for frequent ones. Good for sparse data but learning rate can become too small over time.
RMSProp
Fixes AdaGrad’s decaying learning rate problem by using an exponentially decaying average of squared gradients. Works well for non‑stationary objectives (e.g., RNNs).
Adam (Adaptive Moment Estimation)
Adam combines momentum and RMSProp. It maintains both an exponentially decaying average of past gradients (first moment) and past squared gradients (second moment). Adam is the default choice for most deep learning tasks.
# PyTorch
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# TensorFlow
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)Hyperparameters: β1=0.9, β2=0.999, ε=1e‑8.AdamW (Adam with Weight Decay)
AdamW decouples weight decay from the adaptive updates, leading to better generalization. Often used with transformers and large models.
Choosing an Optimizer
Start with Adam (lr=0.001). For computer vision, SGD with momentum (lr=0.01, momentum=0.9) can sometimes achieve better final accuracy. For transformers, AdamW.
Two Minute Drill
- SGD with momentum: faster convergence.
- RMSProp: adapts learning rates, good for RNNs.
- Adam: combines momentum + RMSProp – default choice.
- AdamW: Adam with decoupled weight decay.
Need more clarification?
Drop us an email at career@quipoinfotech.com
