Machine Learning Fundamentals: Comprehensive Lecture Notes

I. Introduction to Machine Learning (Lecture 01)

Key Concepts:

  • Definition:
    Machine Learning (ML) is a subfield of AI that uses data-driven algorithms to recognize patterns and make decisions without explicit programming.
  • Why Learn ML?
    • Automates tasks (e.g., image classification)
    • Adapts to new data
    • Provides insights from complex datasets
  • Terminology:
    • Training Example: An individual row (or data point) in your dataset.
    • Feature: A measurable property or characteristic (e.g., pixel values, square footage).
    • Target/Label: The “correct answer” or output you want to predict.
    • Loss Function: A function measuring the difference between predictions and the true target.

Example:
For image classification (e.g., cat vs. dog), traditional methods required manual feature extraction. Modern ML leverages convolutional neural networks (CNNs) that automatically learn features from raw pixels.
lecture_01 Introduction…

II. Overview of ML Algorithms (Lecture 02)

Key Algorithms Covered:

  • Logistic Regression:
    • Used for binary classification (e.g., predicting pass/fail).
  • K-Nearest Neighbors (KNN):
    • Classifies new instances based on the majority label of the nearest training examples.
  • Decision Trees:
    • Splits data by asking a series of questions (e.g., “Is income > θ?”) to form a tree structure for classification or regression.
  • Naïve Bayes:
    • A generative classifier that applies Bayes’ rule with the assumption of feature independence.
  • Random Forests:
    • An ensemble of decision trees that improves predictive accuracy by averaging (or voting) over many trees.
  • Linear Regression:
    • Models the relationship between inputs (features) and a continuous output by fitting a line or hyperplane.

The lectures emphasize a common pattern: explanation, mathematical intuition, derivation, coding implementation, and practical examples.
Lecture_02 Intro to ML …


III. Linear Regression (Lecture 03)

Key Points:

  • Definition:
    A statistical method that models the relationship between a dependent variable y and one or more independent variables x.
  • Model Equation (Simple Linear Regression): y = w0 + w1x + ϵ where w0 is the intercept, w1 is the slope, and ϵ represents the error.
  • Multiple Linear Regression:
    Uses multiple features: y = β0 + β1x1 + β2x2 + ⋯ + βnxn + ϵ
  • Optimization:
    Parameters are typically estimated by minimizing a loss function (e.g., Mean Squared Error) using methods like gradient descent.

Example Problem – House Price Prediction:
Given a dataset of houses with square footage and corresponding prices, fit a linear regression model.

  1. Data:
    • Features: Square footage (e.g., 1000, 1500, 2000)
    • Target: Price (e.g., $200k, $275k, $340k)
  2. Task:
    • Estimate w0 and w1 such that the prediction y minimizes the squared error.
  3. Steps:
    • Write the loss function: L(w0, w1) = 1/Ni=1N (yi – (w0 + w1 xi))2
    • Use gradient descent updates: w0 := w0ηL/w0, w1 := w1ηL/w1
  4. Interpretation:
    The fitted line will help predict house prices based on square footage.
    lecture_03 Linear Regre…

IV. Bias-Variance Tradeoff and Under/Overfitting (Lecture 05b)

Key Concepts:

  • Bias:
    Error due to overly simplistic assumptions in the learning algorithm. High bias can cause underfitting.
  • Variance:
    Error due to excessive sensitivity to fluctuations in the training data. High variance can cause overfitting.
  • Underfitting vs. Overfitting:
    • Underfitting: Model is too simple; both training and test errors are high.
    • Overfitting: Model is too complex; low training error but high test error.
  • Techniques to Address:
    • Increase model complexity for underfitting.
    • Use regularization, early stopping, or gather more data for overfitting.

Example Problem – Diagnosing Model Fit:
Suppose a model has a high training error and high test error. What does this indicate, and how might you improve the model?

  • Answer:
    This situation indicates underfitting (high bias). Possible solutions include adding more features, increasing model complexity, or reducing excessive regularization.
    lecture_05b_biasVar – T…

V. Regularization and Hyperparameter Tuning (Lecture 06)

Regularization Techniques:

  • Ridge Regression (L2 Regularization):
    Adds a penalty proportional to the sum of squared coefficients: Lridge = LOLS + λj=1n βj2 Shrinks coefficients but does not set them exactly to zero.
  • LASSO Regression (L1 Regularization):
    Adds a penalty proportional to the sum of absolute values: Llasso = LOLS + λj=1n |βj| Can force some coefficients to zero (feature selection).
  • Elastic Net:
    Combines L1 and L2 penalties.

Hyperparameter Tuning:

  • Definition:
    Hyperparameters are settings (e.g., regularization strength λ, learning rate) that are not learned from data.
  • Common Methods:
    • Grid Search: Exhaustively searches a predefined set of hyperparameters.
    • Random Search: Randomly samples the parameter space.
    • Bayesian Optimization: Uses probabilistic models to focus on promising hyperparameter regions.

Example Problem – Regularized Loss Computation:
Given a linear regression model with coefficients β = [2.0, -0.5] and a regularization parameter λ = 1.0, compute the Ridge regularized loss for a single training example with error e = 3.

  1. Ordinary Squared Error: e2 = 9
  2. Ridge Penalty: λ(β12 + β22) = 1.0 × (2.02 + (-0.5)2) = 1.0 × (4 + 0.25) = 4.25
  3. Total Loss: 9 + 4.25 = 13.25
    Lecture_06 Regularizati…

VI. K-Nearest Neighbors (KNN) (Based on Lecture 02 and Related Content)

Concepts:

  • Intuition:
    Classify a new data point by finding the k closest training examples and using a majority vote (for classification) or averaging (for regression).
  • Algorithm:
    1. Compute distance (e.g., Euclidean) between the new point and all training examples.
    2. Select the k nearest neighbors.
    3. For classification: assign the class that appears most frequently; for regression: average the neighbors’ outputs.

Example Problem – Fruit Classification with KNN:
Given a dataset of fruits with features [color intensity, size] and labels (apple, orange), classify a new fruit with features [0.7, 5.0] using k = 3.

  • Steps:
    1. Compute distances to all labeled points.
    2. Identify the 3 closest fruits.
    3. Determine the majority label among these neighbors.
  • Answer:
    Based on the computed distances, if 2 out of 3 neighbors are apples, then classify the new fruit as an apple.

VII. Decision Trees and Regression Trees (Lecture 07)

Key Points:

  • Decision Trees:
    • Build a tree structure by splitting data on features that best separate the target classes (for classification) or predict continuous outcomes (for regression).
    • Impurity Measures:
      • Entropy/Information Gain: Measures disorder; used to decide the best split.
      • Gini Impurity: Another metric for classification splits.
  • Regression Trees:
    • Instead of class labels, these trees predict continuous values by partitioning the feature space.

Example Problem – Building a Simple Decision Tree Regressor:
You are given a small dataset of car ages and prices. Outline how to construct a decision tree that predicts car price based on age.

  1. Split Criterion:
    Find the age threshold that minimizes the variance in car prices within the resulting groups.
  2. Recursive Splitting:
    Continue splitting each subset until a stopping criterion is met (e.g., minimum number of samples per leaf).
  3. Prediction:
    The predicted price for a new car is the average price in the leaf node where it falls.
    Lecture_07 DTRegressors

VIII. Ensemble Learning: Bagging, Random Forests, and Boosting (Lecture 08)

Concepts:

  • Bagging (Bootstrap Aggregating):
    • Build multiple models (often decision trees) on different bootstrapped samples of the training data.
    • Aggregate the predictions (by majority vote or averaging) to reduce variance.
  • Random Forests:
    • An extension of bagging where each tree considers a random subset of features at each split.
    • This further decorrelates trees and improves generalization.
  • Boosting:
    • Sequentially train weak learners (e.g., decision stumps), each focusing on the errors of the previous one.
    • Examples include AdaBoost and Gradient Boosting.
  • Comparison:
    • Bagging/Random Forest: Mainly reduces variance.
    • Boosting: Reduces bias and can achieve higher accuracy but may overfit if overdone.

Example Problem – Ensemble Prediction:
Given predictions from 3 decision trees for a classification problem (Tree1: Class A, Tree2: Class B, Tree3: Class A), determine the ensemble output using majority vote.

  • Solution:
    Majority vote yields Class A since it appears twice.
    Lecture_08_Ensemble_Ran…

IX. Practice Problems (Based on Handsolved ML Problems – Lecture 09)

Practice Problems Overview:

  • Problem 1:
    Optimize a linear regression model by deriving the gradient descent update rule and implementing one iteration on a toy dataset.
  • Problem 2:
    Given training and test errors for a model, determine whether the model is overfitting, underfitting, or well-balanced. Propose improvements accordingly.
  • Problem 3:
    For a classification task using KNN, calculate the Euclidean distances for a set of test points and determine their predicted classes using k=5.

Work through these problems by writing down the steps, calculations, and verifying your answers against known results.
Lecture_09 handsolved_M…


X. Probability and Distributions (Lecture 10)

Fundamental Topics:

  • Probability Interpretations:
    • Frequentist: Probability as long-run frequency (e.g., P(Heads) = 0.5 for a fair coin).
  • Random Variables:
    • Discrete:
      • Uniform, Bernoulli, Binomial, and Degenerate distributions.
    • Continuous:
      • Gaussian (Normal) distribution, characterized by its mean μ and variance σ2.
  • Important Concepts:
    • Union and Joint Probability:
      • Union: P(A∪B)
      • Joint: P(A∩B), with the product rule for independent events.
    • Conditional Probability and Bayes’ Rule:
      • Bayes’ theorem: P(A|B) = P(B|A)P(A)/P(B)
  • Categorical and Multinomial Distributions:
    • Categorical Distribution: One trial with outcomes represented via one-hot encoding.
    • Multinomial Distribution: Extends categorical to multiple trials, counting occurrences in each category.
  • Covariance and Multivariate Gaussian:
    • Covariance: Measures the degree to which two variables change together.
    • Multivariate Gaussian: Generalization of the normal distribution to multiple dimensions.

Example Problem – Applying Bayes’ Rule:
Suppose a medical test has a 95% accuracy rate. If the prevalence of the disease is 1% in the population and a patient tests positive, calculate the probability they actually have the disease.

  • Solution Outline:
    1. Define:
      • P(Disease) = 0.01
      • P(Positive|Disease) = 0.95
      • P(Positive|No Disease) = 0.05
    2. Compute P(Positive) using the law of total probability.
    3. Apply Bayes’ theorem to find P(Disease|Positive).
      Lecture_10_probability_…

XI. Generative Models and Naïve Bayes (Lecture 11)

Key Points:

  • Generative Models:
    • Model the joint probability P(x, y) and then use Bayes’ rule to compute the posterior P(y|x).
  • Naïve Bayes Classifier:
    • Assumes conditional independence between features given the class label.
    • Calculation: P(y|x1, x2, …, xn) ∝ P(y) ∏i=1n P(xi|y)
  • Application Areas:
    • Text classification (e.g., spam filtering)
    • Medical diagnosis
    • Any domain where the “naïve” independence assumption is acceptable.

Example Problem – Naïve Bayes Text Classification:
Given a simplified dataset where emails are labeled as “spam” or “not spam” and feature probabilities P(word|spam) are provided, compute the posterior probability for a new email containing specific words.

  • Steps:
    1. Compute the prior probabilities P(spam) and P(not spam).
    2. Multiply by the likelihoods P(wordi|spam) for each word in the email.
    3. Normalize to obtain the posterior probabilities.
      Lecture_11_Generative_N…

XII. Summary and Study Tips

Key Takeaways:

  • Foundational Concepts:
    Understand core definitions (features, labels, loss functions) and the general ML workflow.
  • Model Selection:
    Recognize the strengths and weaknesses of different algorithms (e.g., linear vs. decision trees vs. ensemble methods).
  • Practical Considerations:
    Focus on bias–variance tradeoff, regularization techniques, and the importance of hyperparameter tuning.
  • Probability:
    Be comfortable with basic probability theory and its application in generative models.

Study Tips:

  • Practice Problems:
    Work through the example problems provided, and try variations on them.
  • Visualization:
    Draw diagrams (e.g., decision trees, regression lines) to solidify your understanding.
  • Coding Exercises:
    Implement simple models in Python (using libraries like scikit-learn) to observe theoretical concepts in practice.
  • Review Hand-Solved Problems:
    Revisit the worked examples from Lecture 09 to understand common pitfalls and solution strategies.