Loading

Quipoin Menu

Learn • Practice • Grow

math-for-ai / The Chain Rule
interview

Q1. Scenario: In neural networks, the final output y depends on hidden layer h, which depends on weights w. The loss L = (y - t)². How do you get ∂L/∂w using chain rule?
∂L/∂w = (∂L/∂y)·(∂y/∂h)·(∂h/∂w). Each factor is a simpler derivative: ∂L/∂y = 2(y - t), ∂y/∂h = activation derivative, ∂h/∂w = input to the layer. This multiplicative chaining is the essence of backpropagation, allowing gradient flow through multiple layers.

Q2. Scenario: A compound function f(x) = sin(3x²+2). Using chain rule, find f'(x) and evaluate at x=1.
Let u = 3x²+2, then f=sin(u). du/dx = 6x, df/du = cos(u). So f'(x)=cos(3x²+2)·6x. At x=1: cos(5)·6 ≈ (0.2837)*6 = 1.702. The chain rule applies to nested functions in machine learning when composing multiple layers.

Q3. Scenario: In backpropagation, you have a 3-layer network: input -> hidden1 -> hidden2 -> output. Derive ∂L/∂W1 using chain rule.
∂L/∂W1 = (∂L/∂out2)·(∂out2/∂hidden2)·(∂hidden2/∂out1)·(∂out1/∂hidden1)·(∂hidden1/∂W1). This product of Jacobians (or element-wise for scalar outputs) propagates error from output back to input layer. Chain rule allows gradient computation even for very deep networks.

Q4. Scenario: In time series prediction (RNN), the loss depends on hidden state at time t, which depends on previous state. How does the chain rule account for this recurrence?
Backpropagation Through Time (BPTT) unrolls the RNN over time steps, then applies chain rule across timesteps. ∂L/∂W = sum over t of (∂L/∂h_t)·(∂h_t/∂h_{t-1})·...·(∂h_1/∂W). The chain rule accumulates gradients, but it can lead to vanishing/exploding gradients (solved by LSTM/GRU).

Q5. Scenario: You have a cost function J = f(g(h(x))). At x=2, f'(output)=3, g'(intermediate)=0.5, h'(2)=4. Find dJ/dx using chain rule and interpret.
dJ/dx = f'(g(h(x)))·g'(h(x))·h'(x) = 3 * 0.5 * 4 = 6. This means a small change in x changes J by about 6 times. If this product is small (<<1), gradient vanishes; if large (>1), it explodes. This explains the vanishing/exploding gradient problem in deep networks.