Machine Learning: From Perceptron to CNNs and LSTMs

Posted on Feb 2, 2025 in Criminology and Public Prevention Policies

PPT 1: Discriminative and Generative Probability

Discriminative Probability: p(c|x)

Generative Probability: p(c) * p(x|c)

Examples:

Classification: Spam filtering, face recognition
Regression: Weather, stock trend prediction
Ranking: Web search, finding similar images
Collaborative Filtering: Recommendation
Clustering: Grouping similar things
Embeddings: Visualizing data

PPT 2: The Perceptron Algorithm

Start with an all-zeros weight vector w₁ = 0 and initialize t to 1. Automatically scale all examples x to have a Euclidean length of 1.
Given example x, predict positive if w_t * x > 0.
On a mistake, update: Mistake on positive: w_(t+1)
t + 1

Problems with the Perceptron Algorithm

If training data is not linearly separable, there are no guarantees of convergence or training accuracy.
Even if the training data is linearly separable, the perceptron can overfit.
The averaged perceptron is an algorithm modification that helps with both issues.

Activation Functions

Linear: y = z
ReLU: y = max(0, z)
Soft ReLU: y = log₁ + e^z
Hard Threshold: y = 1 if z > 0; y = 0 if z < 0

PPT 3: Expressive Power, Backpropagation, and Chain Rule

Expressive Power

Deep linear networks are no more expressive than linear regression. Boolean circuits AND, OR, and NOT can be translated to a feed-forward neural network.

Backpropagation

d f(x(t)) / dt = df/dx * dx/dt

Least Squares Model

z = wx + b, y = sigmoid(z), Loss(L) = 1/2(y – t)²

Univariate Chain Rule Loss Derivative

dL/dw = (sigmoid(wx + b) – t) * sigmoid_derivative(wx + b) * x

dL/db = (sigmoid(wx + b) – t) * sigmoid_derivative(wx + b)

dL/dy = y – t

dL/dz = dL/dy * sigmoid_derivative(z)

dL/dw = dL/dz * x

dL/db = dL/dz

dL/dy is also called the error signal, denoted as y^bar.

Multivariate Chain Rule

Problem: What if the computation graph has a fan-out > 1?

Solution: Use the multivariate chain rule.

d(f(x(t), y(t)))/dt = df/dx * dx/dt + df/dy * dy/dt

Two solutions are L2 regularized regression and multiclass logistic regression.

L2 Model

z = wx + b, y = sigmoid(z), L = 1/2(y – t)², R = 1/2(w²), L_reg = L + lambda(R)

Multiclass LR Model

z_l = sum(w_lj * x_j + b_l), y_k = e^z_k / sum(e^z_l), L = -sum(t_k * log(y_k))

Backward Pass L2 Model

L_reg^bar = 1, R^bar = L_reg^bar * dL_reg/lambda, L^bar = L_reg^bar, y^bar = L^bar(y – t), z^bar = y^bar * sigmoid_derivative(z), w^bar = z^bar * x + R^bar * w, b^bar = z^bar

Backward Pass MLP (Multiclass LR)

L^bar = 1, y_k^bar = L^bar(y_k – t_k), w_ki⁽²⁾^bar = y_k^bar * h_i, b_k⁽²⁾^bar = y_k^bar, h_i^bar = sum(y_k^bar * w_ki⁽²⁾), z_i^bar = h_i^bar * sigmoid_derivative(z_i), w_ij⁽¹⁾^bar = z_i^bar * x_j, b_i⁽¹⁾ = z_i^bar

Computational Cost

Forward pass: One add-multiply operation per weight.
Backward pass: Two add-multiply operations per weight.

Rule of thumb: The backward pass is twice as expensive as the forward pass. For MLP, this means the cost is linear in the number of layers and quadratic in the number of units per layer.

PPT 4 & 5: N-gram, GloVe, Gradient Problems, and LSTM

Problems with N-gram

Data sparsity: Most n-grams never appear in the corpus, even if they are possible.
The number of entries in the conditional probability table is exponential in the context layer.

Solutions for Data Sparsity

Use a short context (but this means the model is less powerful).
Smooth the probabilities by adding imaginary counts.
Make predictions using an ensemble of n-gram models with different n.

GloVe Problems

X is extremely large, so fitting the above factorization using least squares is infeasible.
Solution: Reweight entries so that only nonzero counts matter.
Word counts are a heavy-tailed distribution, so the most common words will dominate the cost function.
Solution: Approximate log(x_ij) instead of x_ij.

Gradient Vanishing and Exploding Problems

Solution 1: Gradient clipping
Solution 2: Reverse the input sequence.
Main Solution: LSTM

8NxzsuMbNcYAAAAASUVORK5CYII=

CNN Non-linear Rectification

It’s a linear operation. So we need nonlinearity, otherwise, 2 layers would be no more powerful than 1. It makes gradients sparse, which helps optimization.

AAAAAElFTkSuQmCC