Machine Learning Algorithms: Data Science Essentials
Posted on Mar 12, 2025 in Archaeology
Data & K Nearest Neighbor
- What is Data? A collection of items & their characteristics, d features or attributes which describe d items & make them unique.
- Why is Data important? We can use data to extract useful info to detect patterns which may not be visible to the naked eye, which can then be used to make predictions or decisions.
- Input vector: list of numbers which describe an item.
- KNN is a supervised learning algorithm, meaning it learns from labeled data.
- KNN finds similarities by comparing to already known data using math formulas called distance measures. Different situations call for different distance measures. Choosing the right distance measure is like picking the best transportation method for your data. You want the measure that most accurately represents the relationship between the features in your data. KNN makes decisions based on majority votes involving K neighbors.
- KNN works by finding the closest points in a dataset to the new data point to be classified. K is the # you pick which determines the # of neighbors to consider. It then uses a distance measure to determine how close the points are to each other. A commonly used distance measure is Euclidean distance, which is similar to measuring a straight line distance between 2 points. Once KNN finds K, it then looks at the class/category which the majority of the neighbors belong to, called majority voting. If you choose a small K, it will result in KNN being sensitive to local patterns in the data leading to overfitting, which results in the model learns the noise in the data & doesn’t respond well to new data. If you choose a large K, KNN will be more stable & less sensitive but it may underfit the data & fail to capture important patterns.
- Pros: Simple, no explicit training, easy, can handle multiple classes
- Cons: can be slow with large datasets due to having to calculate distances for each prediction, performance can degrade with increasing # of features due to differences between points becoming less meaningful, can be biased if classes are imbalanced because of major class overpowering min class
Decision Trees
- A tree of decisions which aim to list all possible decisions/outcomes.
- Internal nodes in the tree represent tests on features, ex. Is the outlook sunny?
- Branches represent outcomes of tests, if outlook is sunny you follow the corresponding branch, if not, then follow the other.
- Leaf nodes represent final decisions, after answering all questions along a path, you arrive at a leaf node which tells you the final decision.
- Key aspects:
- Make predictions by splitting data based on features. For continuous features (such as temp), tree checks if feature is > or < than a certain threshold. For discrete features (such as outlook), tree splits data into possible values.
- Each path from root to leaf defines a region in the input space.
- Built using a greedy approach, starts empty, at each step picks the feature which best reduces loss or uncertainty.
- Information gain used to choose best feature to split on where how well an attribute separates the training examples according to target classification is measured & highest information gain is chosen.
- Uses entropy, which measures impurity of a set of examples. A set with mixed classes has high entropy, while a set w/ 1 class has low.
- Conditional entropy, which measures amount of uncertainty given knowledge of another variable.
- Mutual info, amount of info gained about 1 variable by knowing another.
- How to build a tree:
- Start w/ entire training dataset
- Pick feature to split data which provides most information about target variable
- Split examples into groups based on feature value
- For each group
- If group empty return majority class from parent node
- If all examples in group belong to same class return class
- Otherwise repeat process from step 2
- What makes a good tree
- Not too small, large enough to capture important distinctions in data
- Not too big, small enough to avoid overfitting, shouldn’t memorize training examples but should generalize to new data
- Simple as possible while fitting data well, Occam’s razor, favors simplest explanation
- Pros: simple, easy to handle discrete features, can deal w missing values & poorly scaled data, fast at test time
- Cons: prone to overfitting esp w noisy data, greedy algos may not find optimal tree, can be unstable where small changes in data lead to large changes in tree
- How to avoid overfitting
- Overfitting happens when tree learns training data too well, including noise, leading to high accuracy on training data but low accuracy on new data
- To avoid
- Stop growing tree when data split not statistically significant
- Grow tree fully & prune back
Ensemble Learning
- Like asking a group of experts for opinions before making a decision. Instead of relying on 1 model, multiple models are combined to make a stronger & more accurate model. Most ensemble methods use a majority vote principle, final prediction based on what most individual models agree on. Also possible to weigh votes, resulting in certain models having more influence.
- Why use? To reduce errors & improve overall performance.
- If a model is too complex, it may result in overfitting, ensemble methods can help reduce by averaging out the predictions of multiple models.
- If a model is too simple, may result in underfitting, ensemble methods can combine weak models to create a strong model w lower bias.
- By combining models, ensemble methods can often result in higher accuracy than a single model.
- Several ways to create
- Train diff types of ML algorithms (ex. Svm, knn) on same dataset
- Train on diff datasets
- Bagging – create multiple training datasets by randomly sampling original set w replacement, some data points may be included multiple times in a single set while others are left out
- Boosting – train models sequentially w each model focusing on examples previous models got wrong
- Randomly select subsets of features to train each model
- Types of ensemble methods
- Bagging (bootstrap aggregating)
- Create multiple datasets by sampling the original dataset with replacement
- Train a separate model on each dataset
- Average the predictions of all the models
- Reduces variance, but not bias
- Boosting
- Train models sequentially, with each model focusing on the examples that the previous models got wrong
- Use a weighted training set to focus on specific examples
- Often uses weak learners (models that perform slightly better than random)
- Reduces bias
- Increases variance
- Random Forests
- Type of bagging using decision trees as base models
- Sample data & randomly select subset of features to consider at each node of tree
- Often works well without much tuning
- Solves problem of overfitting because output based on maj vote or averaging
- Pros: improved accuracy, less sensitive to noise & outliers, can generalize better to new data
- Cons: can be more complex than single models, can take longer to train, can be more difficult to interpret
Linear Regression
- Algorithm that helps us understand relationship between variables by helping to find a line which best fits the data & represents the relationship between input features & target variable.
- Relationship represented using linear equation
- Y = wx + b (1 feature), where y is predicted value, x is input feature, w is weight, b is bias
- Y = w1x1 + w2x2 + … + b (multiple features), w1 & w2 weights for features x1 & x2
- Algorithm learns best values for weights & bias based on training data, which determine slope & y intercept of the line
- Once model trained, input feature values can be plugged in to predict target value
- How does it work
- Algo learns from training set of data points where each point consists of input features & a corresponding target value
- Loss function measures models predictions compared to actual target values & if they match, goal is to minimize loss. A common loss function is mean squared error (MSE) which calculates the avg squared difference between predicted & actual values
- Algo finds values of w & b that minimize loss function using
- Direct solution where optimal values are directly calculated by setting derivative of loss function to 0 & solving equation
- Iterative optimization where optimization algos are used (like gradient descent) to iteratively adjust values of w & b til function minimized
- To prevent overfitting, a regularizer can be added which will penalize large weights to encourage the model to find a simpler solution
- If relationship between input features & target var isn’t linear, polynomial feature mapping can be used, which involves transforming input features into polynomial features (ex. X, x^2, x^3) & then applying linear regression.
- Logistic regression
- Linear model w logistic nonlinearity known as log-linear
Naive Bayesian
- Simple & commonly used ML algo for classification
- Based on bayes’ rule & called naive because it makes a strong assumption of conditional independence between features
- Bayes’ rule
- Formula that helps update your beliefs based on new evidence
- Expressed as P(A | B) = P(B | A) * P(A) / P(B)
- P(A | B) is the posterior probability of hypothesis A given evidence B
- P(B | A) is the likelihood of evidence B given hypothesis A
- P(A) is the prior probability of hypothesis A
- P(B) is the prior probability of evidence B
- Naive bayes assumes features are conditionally independent of each other given class label. Knowing the value of 1 feature doesn’t tell you anything about the val of another feature assuming you know the class. Often not true in real world scenarios, but naive bayes can perform well in practice.
- How does it work
- Learn P(Y|X): Instead of learning a direct mapping from inputs to outputs, Naive Bayes learns the probability of each class given the input features
- Pick the Most Probable Y: Given an input x = [x1, x2, …, xD], the goal is to find the class yk that maximizes the posterior probability P(Y=yk | x)
- Calculate the Probability: For each class yk, calculate the probability P(Y=yk | x1, x2, …, xd) using Bayes’ rule & the conditional independence assumption
- Output yk with the Highest Probability: Choose the class yk with the highest calculated probability
- Steps
- Calculate prior probability of each class
- Estimate conditional probability of each feature given each class
- Use bayes rule to calculate posterior probability of each class given input features
- Select class w highest posterior probability as predicted class
- Example
- Imagine you want to classify whether someone will play tennis based on weather conditions. the features are outlook, temperature, humidity, & wind. Naive Bayes would calculate the probability of playing tennis (yes) or not playing tennis (no) given the specific weather conditions. It would then predict the outcome with the higher probability.
- Pros: often performs well in practice esp w limited data, popular choice for many classification tasks
- Cons: features usually not conditionally independent, can overfit w too little data, can cause problems if feature val never occurs w class in training data resulting in probability of zero
- Can be remedied with maximum a posteriori estimate
Evaluation & Validation
- Processes to check how well a model works
- Evaluation: assessing how well model’s predictions match up to actual values to help estimate model’s predictive power on unseen data & compare diff models.
- Validation: evaluating model’s initial performance & adjusting if needed to ensure consistent performance & to detect issues such as over or underfitting
- Important because ensures model works on new data, helps find & prevent overfitting, provides confidence in model’s performance, helps choose best model & improves it, critical for real world applications & decision making
- Evaluation uses metrics to check performance of model, diff metrics used based on whether classification or regression problem.
- Classification metrics – used for problems where goal is to classify data
- Accuracy – proportion of correct predictions
- Precision – proportion of correct positive predictions
- Recall – proportion of actual positive instances that were correctly predicted
- F1-score – balance between precision & recall
- Regression metrics – used for problems where goal is to predict continuous value
- Mean Absolute Error (MAE) – avg abs difference between predicted & actual values
- Mean Squared Error (MSE) – avg squared difference between predicted & actual values
- Root Mean Squared Error (RMSE) – square root of MSE
- R-squared (R^2) – measures proportion of variance in actual values explained by model
- Key techniques
- Train-Test Split: Divide the dataset into 2 parts: a training set (typically 70-80%) & a test set (20-30%)
- the model is trained on the training set
- the model is evaluated on the test set
- Confusion Matrix: A table summarizing the performance of a classification model, showing true positives, true negatives, false positives, & false negatives
- Cross-Validation: A technique to assess model performance on different data subsets
- K-Fold Cross-Validation: the data is split into K folds. the model is trained on K-1 folds & tested on the remaining fold. this is repeated K times, & the results are averaged
- Leave-1-Out Cross-Validation (LOOCV): An extreme case of k-fold cross-validation where k equals the # of samples in the dataset
- Stratified Splits: Ensure each subset represents the class proportions in the total dataset, which is useful when a particular class is rare but important
- Model Tuning
- Hyperparameters: Settings in a machine learning model that are not estimated during model fitting (e.g., k in k-nearest neighbors)
- Model Tuning: the process of selecting the best hyperparameter values for a model using cross-validation
- Over & underfitting
- Overfitting
- Signs – hi training accuracy, low test accuracy
- Prevention – increase training data, feature selection, regularization, early stopping, ensemble methods
- Underfitting
- Signs – poor performance on training & test data
- Prevention/addressing – Increase model complexity, add relevant features, reduce regularization, train for more iterations
- Advanced metrics
- The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between true positive rate & false positive rate, & the Area Under the Curve (AUC) quantifies overall model performance
Bayesian Networks
- Visual way to show how diff things are related & depend on each other using probabilities, combines what is already known about how things are connected w actual data that’s been observed. Like a map that shows how likely diff events are based on what other events have happened.
- Key features
- Uses directed acyclic graph, making it a diagram w arrows that show direction of influence, no loops
- Shows which things only depend on other things, if 2 things are (conditionally) independent, no direct arrow between them
- Shows chance of diff combinations of events happening
- How it works
- Nodes & edges, arrows/edges show how 1 thing influences another
- Each node has a table that shows likelihood of it being in a certain state given states of things that point to it. Tables use conditional probabilities P(Xi | Parents(Xi)), to show how the “parents” (the things pointing to it) affect it
- Network represents overall chance of multiple things happening together, calculated by multiplying conditional probabilities for each thing given its parents
- Conditional independence means 2 things may be related in general but if you know about a 3rd thing they become independent
- If you know y then x doesn’t give you any new info about z
- D separation is a way to check if 2 things are conditionally independent by looking at the graph, turns question of statistical independence into question of connectivity in graph & helps simplify network by finding important variables.
- How to use
- Figure out chances of something given what you know, involves queries of form P(X | E), where X is what you’re curious about, & E is what you already know
- If you know A is true, what’s the chance that C is true
- Learning a bayesian network
- You either know how the network is structured or you have to learn it
- You can either have all the data or some of it
- You can get an expert to design the network & figure out what causes what
- You can learn the network from data, which would require a lot of data
- Pros: handles uncertainty, modular because of separate pieces & training data, lets you add your prior knowledge to data, auto picks important features, you can use a bayesian network to compute probabilities
- Example
- Imagine you have a new burglar alarm. It’s good at detecting burglaries, but sometimes it goes off because of minor earthquakes. You have neighbors, Ali & Veli, who call you when they hear the alarm. Ali sometimes calls when he hears the phone, & Veli sometimes misses the alarm because he likes loud music. With a Bayesian network, you can figure out how likely it is that there was a burglary based on who called you.
Support Vector Machines
- Tool used to categorize things into groups
- Popular because based on solid ideas & work well in practice
- Key ideas
- Main idea is to find a decision boundary that clearly separates diff classes
- Svm uses a linear discriminant function to set decision boundary between classes
- For 2 classes, if g(x) ≥ 0, then label x as positive, otherwise label it as negative
- when decision boundary is linear it’s a hyperplane, orientation determined by normal vector w & location determined by bias w0
- Svm tries to find a hyperplane that separates classes & maximizes margin around decision boundary
- How it works
- Algo looks for best line or hyperplane that separates data points into diff classes
- Aims to max distance between line & closest data pts from each class, this distance is the margin
- Data pts closest to decision boundary & influence its position are called support vectors, they are critical because they define margin & therefore the separating hyperplane
- Svm uses a soft margin which allows some data pts to be misclassified w a penalty making the model more flexible & preventing overfitting
- A dataset is completely separated if all instances of each class fall on the same side of a hyperplane w no overlap
- Svm uses kernel functions to handle non linear data which map the original input space into a higher dimensional space where it’s easier to find a linear boundary, examples include polynomial kernels & radial basis function kernels
- To classify data into >2 categories there are several strategies
- Pose problem as c 2 class problems where the i’th problem is solved by linear discriminant that separates pts assigned to yi from those not assigned to others
- Use c (c-1)/2 linear discriminants, 1 for every class pair
- Use c linear discriminants, 1 for each class, assign x to wi if 𝑔 > 𝑔 ∀ ≠
- Pros: work well even w lot of features, use a subset of training pts (support vectors) in decision function which makes them memory efficient, you have the ability to specify custom kernel functions for the decision function even if common kernels are provided
- Cons: you need to choose the right kernel function & regularization parameters which would require a lot of time & experimentation, svms can be computationally intensive esp for large datasets