Statistical Learning Quiz: Questions and Answers
1. Multiple Choice Questions (3 points each, only one correct answer)
(1) What is the primary goal of statistical learning?
✔ Answer: A. Estimating the relationship between input (X) and output (Y)
👉 Explanation: The goal of statistical learning is to understand how a set of predictor variables (\(X\)) relates to an outcome variable (\(Y\)). This helps in making predictions and drawing inferences about the data.
(2) Why do we estimate \( f \) in statistical learning?
✔ Answer: B. To improve both prediction and inference
👉 Explanation: Estimating \( f \) allows us to either make accurate predictions (prediction) or understand the relationship between predictors and the response variable (inference).
(3) Which of the following is NOT a type of statistical learning method?
✔ Answer: C. Quantum Computing
👉 Explanation: Statistical learning includes parametric (e.G., linear regression)
, nonparametric (e.G., KNN), supervised learning (e.G., classification and regression)
, and unsupervised learning (e.G., clustering). Quantum computing is not a statistical learning method.
(4) In supervised learning, what is the main characteristic of the dataset?
✔ Answer: C. Both predictor (\( X \)) and response (\( Y \)) variables are observed
👉 Explanation: Supervised learning works with labeled data, meaning that for every input variable (\(X\)), the corresponding output variable (\(Y\)) is known.
(5) Which of the following is an example of a classification problem?
✔ Answer: B. Predicting the likelihood of a customer churning in a telecom company
👉 Explanation: Classification problems deal with categorical outcomes. Whether a customer churns (Yes/No) is a binary classification problem. Predicting house prices (option A) or total sales (option C) are regression problems.
(6) What is the primary advantage of using parametric methods?
✔ Answer: B. They simplify estimation by assuming a specific functional form
👉 Explanation: Parametric methods make assumptions about the shape of \( f \), which simplifies estimation and interpretation but may lead to bias if the assumption is incorrect.
(7) What is the main drawback of using a nonparametric method?
✔ Answer: A. It requires a large dataset to make accurate predictions
👉 Explanation: Nonparametric models do not assume a functional form, making them more flexible but requiring large amounts of data to produce reliable estimates.
(8) The biasvariance tradeoff describes the tradeoff between:
✔ Answer: A. Underfitting and overfitting
👉 Explanation: A model with high bias (e.G., linear regression) may underfit, while a model with high variance (e.G., decision trees with deep splits) may overfit. The goal is to balance the two.
(9) Which of the following is a disadvantage of using a highly flexible model?
✔ Answer: A. It can be too complex and lead to overfitting
👉 Explanation: Highly flexible models may capture noise in the data rather than the true pattern, leading to poor generalization on new data.
(10) What is the difference between supervised and unsupervised learning?
✔ Answer: C. In supervised learning, both input (\( X \)) and output (\( Y \)) variables are used, whereas in unsupervised learning only the input (\( X \)) is given
👉 Explanation: Supervised learning uses labeled data (e.G., regression, classification), while unsupervised learning works with unlabeled data (e.G., clustering, dimensionality reduction).
2. Short Answer Questions
(1) Explain the key difference between parametric and nonparametric methods in statistical learning. Provide an example for each.
✔ Answer:
Parametric methods assume a specific functional form for \( f \). Example: Linear regression assumes a linear relationship between \( X \) and \( Y \).
Nonparametric methods do not make assumptions about the functional form. Example: KNearest Neighbors (KNN) predicts \( Y \) based on the majority class of the nearest neighbors.
(2) Provide an example of a supervised learning problem and an unsupervised learning problem. Explain why they fall under these categories.
✔ Answer:
Supervised Learning Example: Predicting whether an email is spam or not using labeled emails (Spam/Not Spam). This is supervised because we have labeled data.
Unsupervised Learning Example: Grouping customers into different segments based on their purchasing behavior. This is unsupervised because we do not have predefined labels.
(3) What is the biasvariance tradeoff? Describe its impact on model selection.
✔ Answer:
The biasvariance tradeoff describes the balance between simplicity (bias) and complexity (variance) in a model:
High Bias: The model is too simple (e.G., linear regression), leading to underfitting.
High Variance: The model is too complex (e.G., deep decision trees), leading to overfitting.
The optimal model finds a balance between bias and variance to minimize prediction error.
(4) Consider a situation where a company wants to predict whether a customer will respond to a marketing campaign. Should this problem be approached using regression or classification? Explain your answer.
✔ Answer:
This is a classification problem because the response variable (whether the customer responds: Yes/No) is categorical, not continuous. If the problem were about predicting the amount a customer would spend, it would be regression.
3. Long Answer Questions
(1) Describe a realworld scenario where you would use a supervised learning approach. Clearly define the predictor variables (\(X\)) and the response variable (\(Y\)).
✔ Answer:
Scenario: Predicting whether a loan applicant will default on a loan.
Predictor Variables (\(X\)): Annual income, credit score, loan amount, employment status.
Response Variable (\(Y\)): Default (Yes = 1, No = 0).
Justification: This is a supervised learning problem because we have labeled past data on whether applicants defaulted.
(2) A company is using a statistical learning model to predict customer churn (whether a customer will leave or stay).
(a) How should they evaluate the accuracy of their model?
✔ Answer:
Accuracy: Percentage of correct predictions.
Precision & Recall: Evaluate false positives/negatives.
Confusion Matrix: Breaks down correct/incorrect classifications.
AUCROC Curve: Measures the tradeoff between sensitivity and specificity.
(b) What tradeoffs exist between using a flexible vs. A simple model for this task?
✔ Answer:
A simple model (e.G., logistic regression) is easy to interpret but may underfit.
A flexible model (e.G., deep neural networks) can capture complex relationships but may overfit and require more data.
The company should choose a model that balances interpretability and accuracy.
Conclusion
This answer key provides detailed explanations for every question in the Lecture 1 Sample Quiz. Let me know if you’d like any modifications or additional examples!