Machine Learning Fundamentals: Comprehensive Lecture Notes
I. Introduction to Machine Learning (Lecture 01)
Key Concepts:
- Definition:
Machine Learning (ML) is a subfield of AI that uses data-driven algorithms to recognize patterns and make decisions without explicit programming. - Why Learn ML?
- Automates tasks (e.g., image classification)
- Adapts to new data
- Provides insights from complex datasets
- Terminology:
- Training Example: An individual row (or data point) in your dataset.
- Feature: A measurable property or characteristic (e.g., pixel values, square footage).
- Target/Label: The “correct answer” or output you want to predict.
- Loss Function: A function measuring the difference between predictions and the true target.
Example:
For image classification (e.g., cat vs. dog), traditional methods required manual feature extraction. Modern ML leverages convolutional neural networks (CNNs) that automatically learn features from raw pixels.
lecture_01 Introduction…
II. Overview of ML Algorithms (Lecture 02)
Key Algorithms Covered:
- Logistic Regression:
- Used for binary classification (e.g., predicting pass/fail).
- K-Nearest Neighbors (KNN):
- Classifies new instances based on the majority label of the nearest training examples.
- Decision Trees:
- Splits data by asking a series of questions (e.g., “Is income > θ?”) to form a tree structure for classification or regression.
- Naïve Bayes:
- A generative classifier that applies Bayes’ rule with the assumption of feature independence.
- Random Forests:
- An ensemble of decision trees that improves predictive accuracy by averaging (or voting) over many trees.
- Linear Regression:
- Models the relationship between inputs (features) and a continuous output by fitting a line or hyperplane.
The lectures emphasize a common pattern: explanation, mathematical intuition, derivation, coding implementation, and practical examples.
Lecture_02 Intro to ML …
III. Linear Regression (Lecture 03)
Key Points:
- Definition:
A statistical method that models the relationship between a dependent variable y and one or more independent variables x. - Model Equation (Simple Linear Regression): y = w0 + w1x + ϵ where w0 is the intercept, w1 is the slope, and ϵ represents the error.
- Multiple Linear Regression:
Uses multiple features: y = β0 + β1x1 + β2x2 + ⋯ + βnxn + ϵ - Optimization:
Parameters are typically estimated by minimizing a loss function (e.g., Mean Squared Error) using methods like gradient descent.
Example Problem – House Price Prediction:
Given a dataset of houses with square footage and corresponding prices, fit a linear regression model.
- Data:
- Features: Square footage (e.g., 1000, 1500, 2000)
- Target: Price (e.g., $200k, $275k, $340k)
- Task:
- Estimate w0 and w1 such that the prediction y minimizes the squared error.
- Steps:
- Write the loss function: L(w0, w1) = 1/N∑i=1N (yi – (w0 + w1 xi))2
- Use gradient descent updates: w0 := w0 – η ⋅ ∂L/∂w0, w1 := w1 – η ⋅ ∂L/∂w1
- Interpretation:
The fitted line will help predict house prices based on square footage.
lecture_03 Linear Regre…
IV. Bias-Variance Tradeoff and Under/Overfitting (Lecture 05b)
Key Concepts:
- Bias:
Error due to overly simplistic assumptions in the learning algorithm. High bias can cause underfitting. - Variance:
Error due to excessive sensitivity to fluctuations in the training data. High variance can cause overfitting. - Underfitting vs. Overfitting:
- Underfitting: Model is too simple; both training and test errors are high.
- Overfitting: Model is too complex; low training error but high test error.
- Techniques to Address:
- Increase model complexity for underfitting.
- Use regularization, early stopping, or gather more data for overfitting.
Example Problem – Diagnosing Model Fit:
Suppose a model has a high training error and high test error. What does this indicate, and how might you improve the model?
- Answer:
This situation indicates underfitting (high bias). Possible solutions include adding more features, increasing model complexity, or reducing excessive regularization.
lecture_05b_biasVar – T…
V. Regularization and Hyperparameter Tuning (Lecture 06)
Regularization Techniques:
- Ridge Regression (L2 Regularization):
Adds a penalty proportional to the sum of squared coefficients: Lridge = LOLS + λ ∑j=1n βj2 Shrinks coefficients but does not set them exactly to zero. - LASSO Regression (L1 Regularization):
Adds a penalty proportional to the sum of absolute values: Llasso = LOLS + λ ∑j=1n |βj| Can force some coefficients to zero (feature selection). - Elastic Net:
Combines L1 and L2 penalties.
Hyperparameter Tuning:
- Definition:
Hyperparameters are settings (e.g., regularization strength λ, learning rate) that are not learned from data. - Common Methods:
- Grid Search: Exhaustively searches a predefined set of hyperparameters.
- Random Search: Randomly samples the parameter space.
- Bayesian Optimization: Uses probabilistic models to focus on promising hyperparameter regions.
Example Problem – Regularized Loss Computation:
Given a linear regression model with coefficients β = [2.0, -0.5] and a regularization parameter λ = 1.0, compute the Ridge regularized loss for a single training example with error e = 3.
- Ordinary Squared Error: e2 = 9
- Ridge Penalty: λ(β12 + β22) = 1.0 × (2.02 + (-0.5)2) = 1.0 × (4 + 0.25) = 4.25
- Total Loss: 9 + 4.25 = 13.25
Lecture_06 Regularizati…
VI. K-Nearest Neighbors (KNN) (Based on Lecture 02 and Related Content)
Concepts:
- Intuition:
Classify a new data point by finding the k closest training examples and using a majority vote (for classification) or averaging (for regression). - Algorithm:
- Compute distance (e.g., Euclidean) between the new point and all training examples.
- Select the k nearest neighbors.
- For classification: assign the class that appears most frequently; for regression: average the neighbors’ outputs.
Example Problem – Fruit Classification with KNN:
Given a dataset of fruits with features [color intensity, size] and labels (apple, orange), classify a new fruit with features [0.7, 5.0] using k = 3.
- Steps:
- Compute distances to all labeled points.
- Identify the 3 closest fruits.
- Determine the majority label among these neighbors.
- Answer:
Based on the computed distances, if 2 out of 3 neighbors are apples, then classify the new fruit as an apple.
VII. Decision Trees and Regression Trees (Lecture 07)
Key Points:
- Decision Trees:
- Build a tree structure by splitting data on features that best separate the target classes (for classification) or predict continuous outcomes (for regression).
- Impurity Measures:
- Entropy/Information Gain: Measures disorder; used to decide the best split.
- Gini Impurity: Another metric for classification splits.
- Regression Trees:
- Instead of class labels, these trees predict continuous values by partitioning the feature space.
Example Problem – Building a Simple Decision Tree Regressor:
You are given a small dataset of car ages and prices. Outline how to construct a decision tree that predicts car price based on age.
- Split Criterion:
Find the age threshold that minimizes the variance in car prices within the resulting groups. - Recursive Splitting:
Continue splitting each subset until a stopping criterion is met (e.g., minimum number of samples per leaf). - Prediction:
The predicted price for a new car is the average price in the leaf node where it falls.
Lecture_07 DTRegressors
VIII. Ensemble Learning: Bagging, Random Forests, and Boosting (Lecture 08)
Concepts:
- Bagging (Bootstrap Aggregating):
- Build multiple models (often decision trees) on different bootstrapped samples of the training data.
- Aggregate the predictions (by majority vote or averaging) to reduce variance.
- Random Forests:
- An extension of bagging where each tree considers a random subset of features at each split.
- This further decorrelates trees and improves generalization.
- Boosting:
- Sequentially train weak learners (e.g., decision stumps), each focusing on the errors of the previous one.
- Examples include AdaBoost and Gradient Boosting.
- Comparison:
- Bagging/Random Forest: Mainly reduces variance.
- Boosting: Reduces bias and can achieve higher accuracy but may overfit if overdone.
Example Problem – Ensemble Prediction:
Given predictions from 3 decision trees for a classification problem (Tree1: Class A, Tree2: Class B, Tree3: Class A), determine the ensemble output using majority vote.
- Solution:
Majority vote yields Class A since it appears twice.
Lecture_08_Ensemble_Ran…
IX. Practice Problems (Based on Handsolved ML Problems – Lecture 09)
Practice Problems Overview:
- Problem 1:
Optimize a linear regression model by deriving the gradient descent update rule and implementing one iteration on a toy dataset. - Problem 2:
Given training and test errors for a model, determine whether the model is overfitting, underfitting, or well-balanced. Propose improvements accordingly. - Problem 3:
For a classification task using KNN, calculate the Euclidean distances for a set of test points and determine their predicted classes using k=5.
Work through these problems by writing down the steps, calculations, and verifying your answers against known results.
Lecture_09 handsolved_M…
X. Probability and Distributions (Lecture 10)
Fundamental Topics:
- Probability Interpretations:
- Frequentist: Probability as long-run frequency (e.g., P(Heads) = 0.5 for a fair coin).
- Random Variables:
- Discrete:
- Uniform, Bernoulli, Binomial, and Degenerate distributions.
- Continuous:
- Gaussian (Normal) distribution, characterized by its mean μ and variance σ2.
- Discrete:
- Important Concepts:
- Union and Joint Probability:
- Union: P(A∪B)
- Joint: P(A∩B), with the product rule for independent events.
- Conditional Probability and Bayes’ Rule:
- Bayes’ theorem: P(A|B) = P(B|A)P(A)/P(B)
- Union and Joint Probability:
- Categorical and Multinomial Distributions:
- Categorical Distribution: One trial with outcomes represented via one-hot encoding.
- Multinomial Distribution: Extends categorical to multiple trials, counting occurrences in each category.
- Covariance and Multivariate Gaussian:
- Covariance: Measures the degree to which two variables change together.
- Multivariate Gaussian: Generalization of the normal distribution to multiple dimensions.
Example Problem – Applying Bayes’ Rule:
Suppose a medical test has a 95% accuracy rate. If the prevalence of the disease is 1% in the population and a patient tests positive, calculate the probability they actually have the disease.
- Solution Outline:
- Define:
- P(Disease) = 0.01
- P(Positive|Disease) = 0.95
- P(Positive|No Disease) = 0.05
- Compute P(Positive) using the law of total probability.
- Apply Bayes’ theorem to find P(Disease|Positive).
Lecture_10_probability_…
- Define:
XI. Generative Models and Naïve Bayes (Lecture 11)
Key Points:
- Generative Models:
- Model the joint probability P(x, y) and then use Bayes’ rule to compute the posterior P(y|x).
- Naïve Bayes Classifier:
- Assumes conditional independence between features given the class label.
- Calculation: P(y|x1, x2, …, xn) ∝ P(y) ∏i=1n P(xi|y)
- Application Areas:
- Text classification (e.g., spam filtering)
- Medical diagnosis
- Any domain where the “naïve” independence assumption is acceptable.
Example Problem – Naïve Bayes Text Classification:
Given a simplified dataset where emails are labeled as “spam” or “not spam” and feature probabilities P(word|spam) are provided, compute the posterior probability for a new email containing specific words.
- Steps:
- Compute the prior probabilities P(spam) and P(not spam).
- Multiply by the likelihoods P(wordi|spam) for each word in the email.
- Normalize to obtain the posterior probabilities.
Lecture_11_Generative_N…
XII. Summary and Study Tips
Key Takeaways:
- Foundational Concepts:
Understand core definitions (features, labels, loss functions) and the general ML workflow. - Model Selection:
Recognize the strengths and weaknesses of different algorithms (e.g., linear vs. decision trees vs. ensemble methods). - Practical Considerations:
Focus on bias–variance tradeoff, regularization techniques, and the importance of hyperparameter tuning. - Probability:
Be comfortable with basic probability theory and its application in generative models.
Study Tips:
- Practice Problems:
Work through the example problems provided, and try variations on them. - Visualization:
Draw diagrams (e.g., decision trees, regression lines) to solidify your understanding. - Coding Exercises:
Implement simple models in Python (using libraries like scikit-learn) to observe theoretical concepts in practice. - Review Hand-Solved Problems:
Revisit the worked examples from Lecture 09 to understand common pitfalls and solution strategies.