Understanding Gradient Descent and Regression Techniques

Posted on Feb 26, 2025 in Mathematics

Gradient Descent

Definition: An optimization algorithm used in machine learning to minimize the cost function of a model by iteratively adjusting its parameters in the opposite direction of the gradient.

How Does Gradient Descent Work?

Gradient descent is an optimization algorithm used to minimize the cost function of a model.
The cost function measures how well the model fits the training data and is defined based on the difference between the predicted and actual values.
The gradient of the cost function is the derivative with respect to the model’s parameters and points in the direction of the steepest ascent.
The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost function.
In each iteration of the algorithm, the gradient of the cost function with respect to each parameter is computed.
The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find the direction of the steepest descent.
The size of the step is controlled by the learning rate, which determines how quickly the algorithm moves towards the minimum.
The process is repeated until the cost function converges to a minimum, indicating that the model has reached the optimal set of parameters.
There are different variations of gradient descent, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with its own advantages and limitations.
Efficient implementation of gradient descent is essential for achieving good performance in machine learning tasks. The choice of the learning rate and the number of iterations can significantly impact the performance of the algorithm.

Gradient Descent Summary

The sum of the squared residuals is just one type of loss function; there are many other types of loss functions, but gradient descent works the same way. Specifically, the steps of the gradient descent algorithm are:

1. Take the derivative of the loss function for each parameter in it. In other words, take the gradient of the loss function.
2. Pick random values for the parameters.
3. Plug the parameter values into the derivatives.
4. Calculate the step sizes: (Step size = Slope x learning rate)
5. New parameter = old parameter – step size
6. Go back to step 3 and repeat until finding a minimum of 0.001 or less or until generally 1000 iterations (epochs).

Stochastic Gradient Descent

Stochastic gradient descent updates the model’s parameters using the gradient of one training example at a time. It randomly selects a training example, computes the gradient of the cost function for that example, and updates the parameters in the opposite direction. Stochastic gradient descent is computationally efficient and can converge faster than batch gradient descent. However, it can be noisy and may not converge to the global minimum.

Difference Between Gradient Descent and Stochastic Gradient Descent:

Gradient Descent considers all the points in calculating the loss and derivatives, while Stochastic gradient descent uses subsets of the data at random to calculate the loss and derivatives. Gradient Descent can be computationally expensive and slow for large datasets, and Stochastic Gradient Descent might be noisy and may not converge to the global minimum.

Ridge Regression

Ridge regression is an algorithm that builds on gradient descent to find an optimum regression. Ridge regression adds a penalty measure to enable a more flexible best-fit line.

y-axis-intercept + slope•weight + λ•||slope||^2

Lasso Regression

Lasso Regression is a regularization algorithm that assists in the elimination of irrelevant parameters, thus helping in the concentration of selection and regularizes the models. In other words, it adds an L1 penalty term to vanish the insignificant predictor variables.

Elastic Net Regression

Elastic Net Regression uses the penalties from both Lasso and Ridge techniques to regularize the regression models. It combines the L1 penalty from lasso regression and the L2 penalty from ridge regression.

Advantages:

Lasso will eliminate many features and reduce overfitting in your linear model. Ridge will reduce the impact of features that are not important in predicting your y values. Elastic Net combines feature elimination from Lasso and feature coefficient reduction from the Ridge model to improve your model’s predictions.

Ridge Regression tends to do better when most variables are relevant. Lasso Regression gets rid of useless variables. Elastic Net Regression is good for many variables when we don’t know which ones are relevant.

Step Size for Gradient Descent

It is the amount of change in the intercept value as we search for the optimum. To avoid over or under-shooting the minimum, we need to find an appropriate step size at a rate that is suitable when far or close to the minimum. Ideally, as the slope of the loss function approaches 0, the step size becomes smaller and smaller.

Step size = slope • learning rate

Learning Rate

A higher learning rate can be used to speed up the descent to the minimum. However, a learning rate that is too high can cause the algorithm to overshoot the minimum and miss it. A learning rate that is too low can cause the algorithm to take too long to find the minimum. Often, manual adjustments to the learning rate are needed to find an optimal balance between speed and accuracy.

4BNOgQVmYg45V4Wv58WRcLR+BLpGv3Rw1yTAkxITg2SYOBjZxsGHDhg0bvwioZvVcnspWEwcqd87n+b0GF038YBN1NAMKdBxDHCrDpFcBpenVwYpxtVLXvvrs9dX0oMJuGXRY+lnEgc8zlB8PNnEwsImDDRs2bNj4FUCFq5WuMVWobutk47cA4P8DUGDaMofJpEIAAAAASUVORK5CYII=

Stemming

Definition: The performance of the sentiment algorithms can sometimes be improved if the words are reduced to their root. With stemming, words with similar meaning can be conveyed with just one word. This representative root word can then be weighted more accurately with the algorithm. Stemming is a blunt approach which normalizes the words by removing their suffixes.

Eg. connect, connected, connecting, connection, connections

Stop Words

Stop words are words that do not have any importance in search queries. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Sentiment Analysis

Sentiment analysis attempts to understand and summarize the feeling and intent behind large volumes of text. It is contextual mining of text which identifies and extracts subjective information in source material, and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations.

N-Grams

They are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document. They are generated by two or more word groupings. We count the frequency of the word groups that appear for that specific ratings. They are useful for examining the highest recurring word that gives the high or low ratings of the topic in question. It can give some significant insight into what people are really saying about it.