Q1. Scenario: You have a convex cost function J(w) = (w-3)². Start at w=10. Perform one step of gradient descent with learning rate η=0.1. Compute new w.
J'(w)=2(w-3). At w=10, J'(10)=14. Update: w_new = w - η·J' = 10 - 0.1*14 = 10 - 1.4 = 8.6. Next step would move closer to optimum (3). This illustrates iterative minimization.
Q2. Scenario: For a cost function with many local minima, why does gradient descent often converge to a local minimum? How to escape saddle points?
Gradient descent follows negative gradient, which points downhill. It will stop at a local minimum if the gradient becomes zero. Saddle points have zero gradient but are not minima; they have directions of negative curvature. Escaping requires second-order methods (Newton), momentum, or stochastic noise (SGD with annealing).
Q3. Scenario: In a neural network training, you use mini-batch gradient descent. What are advantages over batch (full) gradient descent?
Mini-batch computes gradient on a subset of data (e.g., 32 samples). It reduces variance compared to SGD (single sample) but is computationally cheaper than full batch. It allows faster updates, can escape shallow local minima due to noise, and leverages vectorized hardware. Convergence is often faster in practice.
Q4. Scenario: If the learning rate η is too small, what happens? If η is too large, what happens?
If η too small: convergence is very slow, may get stuck in local minima. If η too large: the update may overshoot, cause divergence (cost increases), or oscillate around optimum. A good learning rate should produce steady decrease in cost. Adaptive methods (Adam, RMSprop) adjust per-parameter learning rates.
Q5. Scenario: You have a cost surface J(w1,w2) = w1² + 10*w2². At point (10,1), compute the gradient. Show how steepest descent direction differs from the direction that points directly toward the minimum.
∇J = (2w1, 20w2) = (20,20). Negative gradient = (-20,-20). The true minimum is at (0,0). The direction (-20,-20) is not pointing exactly to (0,0) because the landscape is steeper in w2 direction (coefficient 10). This causes zigzagging; preconditioning (Newton) would account for curvature.
