Loading

Quipoin Menu

Learn • Practice • Grow

math-for-ai / Cost Functions in AI
interview

Q1. Scenario: You are building a linear regression model to predict house prices. The model predictions are ŷ = wx + b. Which cost function would you use to measure the average squared error? Why squared?
Mean Squared Error (MSE): J = (1/n) Σ (y_i - ŷ_i)². Squaring penalizes large errors more heavily, is differentiable everywhere (unlike absolute error at zero), and leads to a convex optimization problem with closed form solution (normal equations).

Q2. Scenario: In logistic regression for binary classification, why do we use cross-entropy loss instead of MSE?
Cross-entropy loss (log loss): J = -[y log(ŷ) + (1-y) log(1-ŷ)]. It is convex for logistic models, while MSE would be non-convex due to sigmoid nonlinearity. Cross-entropy also has probabilistic interpretation as negative log-likelihood of Bernoulli distribution, leading to better gradients for classification.

Q3. Scenario: A cost function J(w) = w⁴ + 2w² + 1. Compute the gradient and Hessian. Is it convex? How many minima?
J'(w)=4w³+4w = 4w(w²+1); J''(w)=12w²+4. Since J''(w) > 0 for all w (minimum 4), the function is strictly convex. Derivative zero only at w=0 → global minimum. Convex cost functions guarantee gradient descent finds global optimum.

Q4. Scenario: For classification with imbalanced classes (e.g., 99% negative, 1% positive), why might accuracy be a misleading cost? What alternative cost functions exist?
Accuracy = 99% by predicting all negative, which is useless. Alternatives: precision, recall, F1-score, or using class-weighted cross-entropy. In cost-sensitive learning, assign higher cost to false negatives for rare positive class. This changes the decision threshold.

Q5. Scenario: In a neural network, you add an L2 regularization term λ Σ w² to the cost function. How does this affect the gradient and optimization?
The new gradient is ∇J_original + 2λw. This adds a weight decay term that pushes weights towards zero, preventing overfitting. It shrinks weights by a factor (1 - 2λ·η) each step. The cost function becomes strongly convex (if original was convex), improving convergence and generalization.