Machine Learning: From Perceptron to CNNs and LSTMs
PPT 1: Discriminative and Generative Probability
Discriminative Probability: p(c|x)
Generative Probability: p(c) * p(x|c)
Examples:
- Classification: Spam filtering, face recognition
- Regression: Weather, stock trend prediction
- Ranking: Web search, finding similar images
- Collaborative Filtering: Recommendation
- Clustering: Grouping similar things
- Embeddings: Visualizing data
PPT 2: The Perceptron Algorithm
- Start with an all-zeros weight vector w1 = 0 and initialize t to 1. Automatically scale all examples x to have a Euclidean length of 1.
- Given example x, predict positive if wt * x > 0.
- On a mistake, update: Mistake on positive: w(t+1)
- t + 1
Problems with the Perceptron Algorithm
- If training data is not linearly separable, there are no guarantees of convergence or training accuracy.
- Even if the training data is linearly separable, the perceptron can overfit.
- The averaged perceptron is an algorithm modification that helps with both issues.
Activation Functions
- Linear: y = z
- ReLU: y = max(0, z)
- Soft ReLU: y = log1 + ez
- Hard Threshold: y = 1 if z > 0; y = 0 if z < 0
PPT 3: Expressive Power, Backpropagation, and Chain Rule
Expressive Power
Deep linear networks are no more expressive than linear regression. Boolean circuits AND, OR, and NOT can be translated to a feed-forward neural network.
Backpropagation
d f(x(t)) / dt = df/dx * dx/dt
Least Squares Model
z = wx + b, y = sigmoid(z), Loss(L) = 1/2(y – t)2
Univariate Chain Rule Loss Derivative
dL/dw = (sigmoid(wx + b) – t) * sigmoid_derivative(wx + b) * x
dL/db = (sigmoid(wx + b) – t) * sigmoid_derivative(wx + b)
dL/dy = y – t
dL/dz = dL/dy * sigmoid_derivative(z)
dL/dw = dL/dz * x
dL/db = dL/dz
dL/dy is also called the error signal, denoted as ybar.
Multivariate Chain Rule
Problem: What if the computation graph has a fan-out > 1?
Solution: Use the multivariate chain rule.
d(f(x(t), y(t)))/dt = df/dx * dx/dt + df/dy * dy/dt
Two solutions are L2 regularized regression and multiclass logistic regression.
L2 Model
z = wx + b, y = sigmoid(z), L = 1/2(y – t)2, R = 1/2(w2), Lreg = L + lambda(R)
Multiclass LR Model
zl = sum(wlj * xj + bl), yk = ezk / sum(ezl), L = -sum(tk * log(yk))
Backward Pass L2 Model
Lregbar = 1, Rbar = Lregbar * dLreg/lambda, Lbar = Lregbar, ybar = Lbar(y – t), zbar = ybar * sigmoid_derivative(z), wbar = zbar * x + Rbar * w, bbar = zbar
Backward Pass MLP (Multiclass LR)
Lbar = 1, ykbar = Lbar(yk – tk), wki(2)bar = ykbar * hi, bk(2)bar = ykbar, hibar = sum(ykbar * wki(2)), zibar = hibar * sigmoid_derivative(zi), wij(1)bar = zibar * xj, bi(1) = zibar
Computational Cost
- Forward pass: One add-multiply operation per weight.
- Backward pass: Two add-multiply operations per weight.
Rule of thumb: The backward pass is twice as expensive as the forward pass. For MLP, this means the cost is linear in the number of layers and quadratic in the number of units per layer.
PPT 4 & 5: N-gram, GloVe, Gradient Problems, and LSTM
Problems with N-gram
- Data sparsity: Most n-grams never appear in the corpus, even if they are possible.
- The number of entries in the conditional probability table is exponential in the context layer.
Solutions for Data Sparsity
- Use a short context (but this means the model is less powerful).
- Smooth the probabilities by adding imaginary counts.
- Make predictions using an ensemble of n-gram models with different n.
GloVe Problems
- X is extremely large, so fitting the above factorization using least squares is infeasible.
- Solution: Reweight entries so that only nonzero counts matter.
- Word counts are a heavy-tailed distribution, so the most common words will dominate the cost function.
- Solution: Approximate log(xij) instead of xij.
Gradient Vanishing and Exploding Problems
- Solution 1: Gradient clipping
- Solution 2: Reverse the input sequence.
- Main Solution: LSTM
CNN Non-linear Rectification
It’s a linear operation. So we need nonlinearity, otherwise, 2 layers would be no more powerful than 1. It makes gradients sparse, which helps optimization.