Gradient Descent
Gradient descent is the optimization algorithm used to update the weights and biases of a neural network to minimize the loss. It moves the parameters in the opposite direction of the gradient.
Weight update: w ← w – learning_rate * ∇w
Three Variants
- Batch Gradient Descent: Uses entire dataset to compute gradient. Accurate but slow and memory‑intensive.
- Stochastic Gradient Descent (SGD): Uses one random sample per update. Fast but noisy.
- Mini‑Batch Gradient Descent: Uses a small batch (e.g., 32 or 64). Best of both worlds – most common in deep learning.
Learning Rate
The learning rate controls step size. Too high: overshoot, diverge. Too low: slow convergence. Typical starting values: 0.001, 0.01, 0.1.
Challenges with Standard GD
- Getting stuck in local minima or saddle points.
- Sensitive to learning rate choice.
- Same learning rate for all parameters.
Epochs and Iterations
One epoch = one pass through the entire training dataset. In mini‑batch GD, one iteration = one batch update. For example, 1000 samples, batch size 32 → 32 iterations per epoch.
Two Minute Drill
- Gradient descent minimizes loss by moving opposite to gradient.
- Batch GD uses all data; SGD uses one sample; mini‑batch uses a batch.
- Learning rate controls step size.
- Mini‑batch GD is the standard in deep learning.
Need more clarification?
Drop us an email at career@quipoinfotech.com
