Q1. You are building a linear regression model to predict house prices. The model predictions are ŷ = wx + b. Which cost function would you use to measure the average squared error? Why squared?
Mean Squared Error (MSE): J = (1/n) Σ (yi - ŷi)2.
Squaring penalizes large errors more heavily, is differentiable everywhere (unlike absolute error at zero), and leads to a convex optimization problem with closed form solution (normal equations).
Squaring penalizes large errors more heavily, is differentiable everywhere (unlike absolute error at zero), and leads to a convex optimization problem with closed form solution (normal equations).
Q2. In logistic regression for binary classification, why do we use cross-entropy loss instead of MSE?
Cross-entropy loss (log loss): J = -[y log(ŷ) + (1-y) log(1-ŷ)].
It is convex for logistic models, while MSE would be non-convex due to sigmoid nonlinearity.
Cross-entropy also has probabilistic interpretation as negative log-likelihood of Bernoulli distribution, leading to better gradients for classification.
It is convex for logistic models, while MSE would be non-convex due to sigmoid nonlinearity.
Cross-entropy also has probabilistic interpretation as negative log-likelihood of Bernoulli distribution, leading to better gradients for classification.
Q3. A cost function J(w) = w4 + 2w2 + 1. Compute the gradient and Hessian. Is it convex? How many minima?
J'(w) = 4w3 + 4w = 4w(w2+1); J''(w) = 12w2 + 4.
Since J''(w) > 0 for all w (minimum 4), the function is strictly convex.
Derivative zero only at w=0 → global minimum.
Convex cost functions guarantee gradient descent finds global optimum.
Since J''(w) > 0 for all w (minimum 4), the function is strictly convex.
Derivative zero only at w=0 → global minimum.
Convex cost functions guarantee gradient descent finds global optimum.
Q4. For classification with imbalanced classes (e.g., 99% negative, 1% positive), why might accuracy be a misleading cost? What alternative cost functions exist?
Accuracy = 99% by predicting all negative, which is useless.
Alternatives: precision, recall, F1-score, or using class-weighted cross-entropy.
In cost-sensitive learning, assign higher cost to false negatives for rare positive class.
This changes the decision threshold.
Alternatives: precision, recall, F1-score, or using class-weighted cross-entropy.
In cost-sensitive learning, assign higher cost to false negatives for rare positive class.
This changes the decision threshold.
Q5. In a neural network, you add an L2 regularization term λ Σ w2 to the cost function. How does this affect the gradient and optimization?
The new gradient is ∇Joriginal + 2λw.
This adds a weight decay term that pushes weights towards zero, preventing overfitting.
It shrinks weights by a factor (1 - 2λ·η) each step.
The cost function becomes strongly convex (if original was convex), improving convergence and generalization.
This adds a weight decay term that pushes weights towards zero, preventing overfitting.
It shrinks weights by a factor (1 - 2λ·η) each step.
The cost function becomes strongly convex (if original was convex), improving convergence and generalization.
