Machine Learning Concepts: Clustering, Classification, and Optimization

Posted on Dec 8, 2024 in Mathematics

Clustering with K-Means

In k-means clustering, if cluster centers are known, assigning points to clusters is straightforward by computing distances. Conversely, if cluster assignments are known, centers are easily calculated by averaging points within each cluster. Stopping criteria include a percentage decrease threshold or checking for assignment changes. K-means has numerous equivalent global minima. In high-dimensional spaces, local minima are common, making optimization challenging.

Probability and Statistics

We differentiate between discrete (e.g., 1, 2, 3) and continuous (e.g., decimal) distributions, such as Uniform and Normal (Gaussian). Cross-validation, including K-Folds and Leave-One-Out, assesses model performance by repeatedly dividing data into training, validation, and test sets.

Joint Probability: P(A,B)
Conditional Probability: P(A|B), where A is conditioned on B
Independence: If P(A|B) = P(A)
Conditional Independence: P(A|B, C) = P(A|C)
Bayes’ Rule: P(A|B) = P(B|A) * P(A) / P(B)

Bayes’ Rule calculates the conditional probability of an event given another. For instance, P(B|A) is the probability of a positive test if a patient has a disease, P(A) is the disease’s prior probability, and P(B) is the overall probability of a positive test.

Unbalanced Classes

A class imbalance occurs when one class is significantly more frequent than others. Metrics like accuracy (TP + TN / N + P), precision (TP / TP + FP), and recall (TP / TP + FN) are used to evaluate performance in such scenarios. Precision is crucial when focusing on the quality of positive predictions, while recall is important for capturing most positive instances, even at the cost of lower accuracy.

Classification Algorithms

K-Nearest Neighbors (K-NN)

Training points are stored for efficient spatial searching.
To classify a point X, the K closest points in the training data are identified.
The most frequent class among these neighbors is assigned to X.

Naive Bayes Classifier

Training:

Store P(c) for each class c from training data.
For each feature x_d, construct a table or model for P(x_i|c) using training data.

Testing:

For each class c, compute: P(c, X_test) = P(c) * {p(x_1|c), p(x_2|c), …, p(x_d|c)}, where P(c, X_test) = P(c|X_test).
Normalize over c to get P(c|X_test) = P(c, X_test) / sum_c(P(c, X_test)).

Neural Networks

Recall that “affine” means linear (dot-product) plus an offset (bias). Non-linearity is essential; otherwise, the network can only perform affine functions. Weights (W), inputs (x), biases (b), and activations (a) are key components. For regression, the network’s output can be used directly. For classification, the softmax function (softmax(z_i)) is applied.

Loss Functions

Classification: Cross-entropy loss is used.
Regression: Squared error loss is used.

The softmax function provides pseudo-probabilities.

Squared Error Loss

If the network output is a real number, squared error from the training data is used as the loss function.

Regularization

Neural networks are prone to overfitting. Regularization terms are added to the objective function to encourage small weights, such as adding a penalty to the objective function.

Optimization

The objective function is a sum over all data points. Gradient descent adjusts weights to reduce error. Gradient computation can be based on:

All data (regular gradient descent, expensive for large datasets).
A single random data point (stochastic gradient descent – SGD).
A batch of random data points (batch gradient descent).

Random point selection helps avoid poor local minima. Back-propagation efficiently computes gradients, avoiding redundant calculations.

Dimensionality

The dimensionality of data refers to the number of variables or features needed to describe a data point. It’s sometimes called “degrees of freedom.” For example, 3-dimensional data requires three numbers (x, y, z). Data can often be represented with fewer dimensions (D < n) by ignoring noise, meaning the intrinsic dimensionality is less than the raw feature count. Redundant information does not contribute new insights.