Bayes Theorem for ML Interview Questions

Q1. Scenario: A spam detection system uses Bayes' theorem. If the prior probability of spam is 0.2, the likelihood of word win" given spam is 0.7 and given ham is 0.1. What is the posterior probability that an email containing "win" is spam?

P(spam|win) = P(win|spam)*P(spam) / [P(win|spam)P(spam) + P(win|ham)P(ham)] = (0.7*0.2) / (0.7*0.2 + 0.1*0.8) = 0.14 / (0.14+0.08) = 0.14/0.22 ≈ 0.636. Naive Bayes classifiers apply this with multiple features assuming independence.

Q2. Scenario: In a diagnostic test sensitivity 95% specificity 90% prevalence 5%. A patient tests positive. Use Bayes' to calculate P(disease). Show steps.

P(disease|+) = (0.95*0.05) / (0.95*0.05 + 0.10*0.95) = 0.0475 / (0.0475 + 0.095) = 0.0475/0.1425 ≈ 0.333. So 33.3% chance. Even with high accuracy low prevalence makes positive result less reliable. This is crucial for medical tests and fraud detection.

Q3. Scenario: In a machine learning context Bayes theorem is used for Bayesian inference. What is the difference between prior and posterior? Give a simple example.

Prior: belief before seeing data (e.g. guess coin is fair P(head)=0.5). Likelihood: probability of observed data given hypothesis. Posterior: updated belief after data. After seeing 10 heads in 10 flips posterior probability of fairness is very low. This forms basis of Bayesian machine learning (Bayesian linear regression Gaussian processes).

Q4. Scenario: A company has two factories producing widgets. Factory A makes 60% of widgets with 2% defect rate; Factory B makes 40% with 5% defect rate. If a widget is defective what is the probability it came from Factory A?

P(A|defect) = (0.6*0.02) / (0.6*0.02 + 0.4*0.05) = 0.012 / (0.012+0.02) = 0.012/0.032 = 0.375. So 37.5% from A 62.5% from B. This is a typical Bayes problem used in quality control and anomaly detection.

Q5. Scenario: Explain how the naive Bayes classifier "naively" assumes independence. Why does it often work despite this assumption?

It assumes features are conditionally independent given the class: P(x1x2...xn | y) = Π P(xi | y). This is rarely true in reality but the classifier often performs well because it only needs the correct class decision not accurate probabilities. Also dependencies might cancel out. It's simple fast and works for text classification.

Welcome to Quipoin

Quipoin Menu