Q1. Scenario: You are training a model and the loss is decreasing slowly. You try increasing the learning rate, but then the loss starts oscillating wildly. What is happening? How to fix?
Learning rate is too large, causing overshooting and divergence. The oscillations mean the updates are crossing the minimum repeatedly. Solutions: decrease learning rate, add momentum to smooth updates, or use learning rate decay (reduce LR over time). Adaptive methods like Adam can also help.
Q2. Scenario: In gradient descent, you can use a learning rate schedule. Why would you start with a higher LR and then decrease? Give an example schedule.
Starting with high LR allows fast initial progress; later, lower LR allows fine-tuning to avoid oscillations around the minimum. Example: exponential decay η_t = η_0 * exp(-k·t), or step decay (reduce by factor every 10 epochs). This improves both speed and final convergence.
Q3. Scenario: For a convex quadratic function, the optimal learning rate is known. For f(w)=½(w-2)², compute the learning rate that takes you to the optimum in one step from any starting point?
f'(w)=w-2. Gradient descent: w_new = w - η(w-2). We want w_new = 2 for any w, so 2 = w - η(w-2) => 2 - w = -η(w-2) => -(w-2)= -η(w-2) => η=1. So η=1 gives exact solution in one step. For general quadratics, optimal η = 1/λ_max (maximum eigenvalue of Hessian).
Q4. Scenario: In non-convex optimization, you observe that training loss plateaus for many iterations, then suddenly drops. What might have happened?
The optimizer may have been stuck near a saddle point or a flat region. The gradient is near zero, but due to momentum or stochastic noise, it eventually escapes. This is common in deep learning; techniques like learning rate warm-up, residual connections, and batch normalization help.
Q5. Scenario: You compare training loss curves with two different learning rates: 0.01 and 0.1. The 0.1 curve shows loss decreasing quickly but then bouncing. The 0.01 curve steadily decreases but takes longer. Which might generalize better? Why?
The 0.01 curve may generalize better because it avoids sharp minima that lead to overfitting (larger LR may overshoot into sharp minima). Literature suggests that flat minima generalize better, and smaller learning rates help converge to flatter regions. However, too small LR may cause under-training.
