Machine Learning Concepts: True or False?
General
1. Regression and clustering are types of supervised learning
FALSE: Clustering is unsupervised.
2. Clustering and dimensionality reduction are types of unsupervised learning
3. Machine learning is particularly useful when we try to solve a problem that is easy to program however data is scarce
FALSE: If the problem is easy to program, then there is no need for ML.
4. Preprocessing is a task that can often be automated
FALSE: Preprocessing requires significant time and expertise and trial and error.
5. In supervised learning, we attempt to predict a target value from feature values describing an object
6. In supervised learning, we always generate models with minimum training error
FALSE: In fact, we would prefer some small training error to zero error with an overfitted model, for example.
7. Empirical risk, the opposite of training error, serves as an approximation to the true risk
FALSE: Empirical risk is the same as training error, so not the opposite.
8. The activation functions of output neurons of a neural network are determined mainly by the nature of the target variable one wants to predict
TRUE
9. It is not possible to train a neural network for both regression and classification at the same time
FALSE: All we need to do is have a differentiable error function combining both regression error and classification error; a simple sum may do.
10. Bigger training sets help to reduce overfitting
TRUE: With more and more data overfitting becomes harder.
11. Backprop is an algorithm used in neural network learning to obtain partial derivatives of an error function with respect to its weights
TRUE
12. The negative log-likelihood can always be used as an error function in supervised learning
TRUE
13. In machine learning it is necessary to have classifiers with zero training error.
FALSE, zero training error may lead to overfitting for example.
14. Regularization is a framework in which poor goodness-of-fit can be compensated with complexity.
FALSE
15. Unsupervised learning is about making predictions on unseen future examples.
FALSE, in unsupervised learning there is no notion of making predictions.
16. High training and test error is an indicator of high bias in a model.
17. High training error and low-test error is an indicator of overfitting in a model.
FALSE, it would be low training and high-test error.
Bayes and probabilities
1. Bayes theorem can be derived from the product rule of probability theory
2. Bayes theorem transforms prior distributions into posterior distributions
3. Bayes formula is used in Bayesian learning to obtain posterior distributions
TRUE
4. P(Y) = SUM P(Y|X = x) P(X = x) for X,Y discrete random variables
5. P(Y) = SUM P(X = x|Y)P(X = x) for X,Y discrete random variables
FALSE
6. Expert information on the domain is encoded into the model through the posterior distribution
FALSE: Through the prior distribution
7. The posterior distribution contains both expert information on the domain and information gathered through observation (data)
8. The likelihood function is a probability distribution over the possible values of the parameters for a model
FALSE: It is not a probability distribution
9. Naive Bayes assumes Gaussian input variables
FALSE: A particular case, Gaussian NB does, but in general NB can be applied to any combination of input variable distributions
10. Gaussian Naive Bayes assumes Gaussian input variables
TRUE
11. Linear models are called linear because they are linear in their parameters.
12. By definition, Bayes rule chooses the class with highest posterior probability.
13. Bayesian classifiers lead to linear decision boundaries.
FALSE, not always, for example QDA leads to quadratic decision boundaries.
14. The bayesian classifier is optimal in the sense that it has the smallest generalization error among all classifiers.
15. The Mahalanobis distance is a generalized form of Euclidean distance that takes into account correlations among variables.
16. If the class-conditional distributions in a classification problem are normal, then linear functions can achieve the best possible generalization error.
FALSE, in QDA we assume normal class-conditional distributions, but optimal classifier is not linear in general.
17. When using QDA or LDA in a practical problem with _nite data that is normally distributed, there is no need to estimate generalization error because we know that these methods are optimal.
FALSE, with _nite data we still need to estimate the parameters of the gaussians and therefore we still need to check quality of the trained models.
18. Laplace correction is particularly necessary in Naive Bayes classification if the dimensionality of the input data is very large.
19. The VC dimension of the k-NN classifier is infinite.
20. Classification is always easier than regression or clustering.
FALSE, it depends on the problem (data).
Regression
1. Least squares linear regression is obtained by assuming Gaussianity on the input variables
FALSE: Gaussianity is assumed for the error term of the target or output variable, assumes Gaussian response
2. Linear regression can produce non-linear predictions if we apply linear transformations on the input variables
FALSE: To produce non-linear predictions, we must apply non-linear transformations
3. The best choice in linear regression is to minimize square error
FALSE: Not necessarily, e.g. In the presence of outliers, it would be better to minimize absolute error
4. High bias models will tend to underfit
5. Low variance models will tend to overfit
FALSE: High variance models tend to overfit
6. Lasso regression uses a form of regularization that is useful in the presence of outliers
FALSE: Lasso uses L1 norm to penalize complexity, as a result it produces feature selection; for outliers we would need L1 error over the error term in regularized expression
7. The GCV for ridge regression computes the LOOCV error exactly
FALSE: GCV is an approximation to LOOCV
8. Lasso and ridge regression both help in reducing overfitting
TRUE: It is the main point of Regularizing
9. Lasso regression is preferable to ridge regression because it produces sparse models
FALSE: It depends on the data, sparse models do not always work best
10. Cross-validation is a resampling method used to select a good model
TRUE
Model selection, resampling, and errors
1. Resampling methods are useful to learn a model’s parameters
FALSE: To learn model’s parameters we use learning algorithms, resampling is for model selection and hyperparameters
2. Resampling methods are useful to learn a model’s hyper-parameters
3. Cross-validation is used to estimate generalization error
4. Cross-validation is used for model selection
5. LOOCV is a type of resampling method that can be used as an alternative to cross-validation
FALSE: LOOCV is not an alternative but a particular case
6. In the presence of scarce data, k-fold cross-validation with high values of k is preferable to low values of k for estimating generalization if possible
7. Minimizing validation error is a good methodology to ensure good generalization
8. Minimizing training error is a good methodology to ensure good generalization
FALSE: Training error is not a good measure to use as it may produce overfitted models
9. Training error is always lower than test error
FALSE: Generally, it is, but it cannot be guaranteed to be so
10. Test data is typically used to estimate generalization error.
11. Validation data is typically used to perform model selection.
Clustering
1. K-means and EM are both methods for learning Mixture of Gaussian models
2. The EM algorithm refines a suboptimal solution obtained by k-means until a global optimum is found
FALSE: EM can get stuck on local optima as well
3. K-means is a part icular case of EM for Gaussian Mixtures when covariance matrices are assumed diagonal
FALSE: Not diagonal, but isotropic or spherical with standard deviation tending to 0
4. Mixing coefficients for the Gaussian mixture are estimated in EM directly from the best soft assignments obtained so far
5. In EM, the log-likelihood cannot decrease after each iteration
6. In k-means it is possible to get stuck on a local optimum however EM solves this problem
FALSE: EM can get stuck on local optima as well
7. It is impossible to evaluate the quality of a clustering result because we never have the ground truth
FALSE: We have heuristics like the CH-index that do this
8. The EM algorithm is particularly suited to learn probabilistic models with partially observed data
TRUE
9. The E-M algorithm is an optimization algorithm that maximizes some likelihood function.
10. The E-M algorithm is guaranteed to find an optimum solution if the input data is distributed according to a mixture of gaussians.
FALSE, the E-M algorithm only finds local optima.
11. If all elements in a mixture of gaussians have equal covariance matrices, then E-M behaves like k-means.
FALSE, the covariance matrices also need to be proportional to the
diagonal matrix
12. Clustering makes little sense in practice since we have no way of knowing whether the result is fully truthful.
FALSE, clustering can be very useful in practice, and it is very much used.