Machine Learning Fundamentals: Data Types, Algorithms, and Clustering
Data Types
Integer: Whole numbers, without decimals.
Double: Rational numbers, i.e., with decimals.
Character: Data in text format. (“Juan”)
Logical: Values of True (TRUE) or False (FALSE)
Data Structures
- Vector: The fundamental data structure in R. It stores an ordered set of values called elements. A vector can contain any number of elements, but all elements must be of the same type.
- Factors: A special data structure where R recognizes that the stored values represent different categories of a qualitative variable. For example: FEMALE, MALE.
- Lists: A data structure with limited and specific use cases, rarely used in this course.
- Data Frames: A typical database structure with rows representing individuals and columns representing variables. It can be understood as a list of vectors with the same length. It has two dimensions (rows and columns).
- Arrays: Similar to data frames but can only store values of one type.
Machine Learning
Uses and Abuses of Machine Learning:
Machines are very limited in their ability to understand problems. They are pure power without control. A machine may be more adept than a human at finding subtle patterns in data, but they still require a human to drive analysis and turn results into action.
Real Success Stories of Machine Learning:
- Customer segmentation for targeted advertising
- Reducing fraudulent credit card transactions
- Development of algorithms for self-piloting drones and vehicles
- Optimizing energy use in homes and offices
- Discovery of genetic sequences associated with diseases
These are the steps to execute a Machine Learning algorithm:
- Data collection
- Data exploration and preparation
- Model training
- Model evaluation
- Model improvement
Types of Algorithms:
Supervised learning: Seeks a relationship between explanatory variables and one (or more) response variables. The data is labeled, and we know what we are looking for. Examples: Classification or numerical prediction.
Unsupervised learning: Aims to find patterns in the data to better understand a certain reality and make better decisions. Examples: Market basket analysis, segmentation.
Train Data and Test Data
To measure a model’s effectiveness, divide data into two groups:
- Train: Used to generate (train) the model. Normally 2/3 of the total sample, chosen at random.
- Test: Used to check the predictive quality of the model. This data has NOT been used to generate the model; it is NEW data. Normally 1/3 of the total sample, chosen at random.
Understanding How the Algorithm Works
- Classify unlabeled individuals by assigning them to a group whose already labeled individuals have similar characteristics.
Advantages: Simple and effective, requires no initial assumptions about the underlying distribution, fast execution.
Disadvantages: Requires appropriate choice of k, does not produce a model, limiting the ability to understand how features relate to category, non-numeric variables and missing data require additional processing.
Measuring Similarities Across Distances
Euclidean distance (very common, computationally expensive if the number of dimensions is large).
Choosing the Value of k Appropriately:
- The decision about the value of k will determine how well the algorithm will generalize to future data.
- Correct balance between overthinking and underthinking. Large k’s tend to overrepresent the most popular category; small k’s tend to overrepresent the impact of “rare” individuals.
Precision: TP / (TP+FP) Measures the proportion of true positives among all positives detected by the model.
Accuracy: (TP+TN)/(TP+TN+FP+FN) Measures the proportion of correct answers (well-classified observations) of the model.
Specificity: TN / (TN+FP) Measures the proportion of true negatives among all the real negatives.
Sensitivity or recall: TP/(TP+FN) Measures the proportion of true positives among all the real positives.
Cluster Analysis: Attempts to summarize the information contained in the rows of a database. This summary consists of grouping individuals based on their similarities (again measured with distances).
Introduction to the Algorithm:
K-Means is an unsupervised learning algorithm. This means that we do not know what we are looking for (unlike k-NN). Its purpose is not predictive or classificatory; it is learning and segmentation.
Advantages: Uses simple principles that can be explained without too much statistical terminology. Highly flexible, behaves well in many real-world cases.
Disadvantages: Not as sophisticated. Requires a reasonable estimate of the number of clusters that exist. Not the most suitable for non-spherical clusters.
For Marketing People: The main (and sometimes almost only) objective of a cluster analysis lies in the convenience of the segments found, that is, in the possibility of defining them in a unique and precise way.
For Statisticians: The most important thing is the variability. Once we have the final result, we can divide the total variance of the data into two: BETWEEN and INTRA. We want to maximize BETWEEN variability and minimize INTRA variability. A 70-30 balance between the two is generally considered acceptable, although it depends on the case.
Choosing the Right Number of Clusters (II)
There is no definitive answer. However, consider these aspects:
- Ideally, clusters should be very similar WITHIN and very different BETWEEN. Individuals within a cluster should be very similar and very different from those in other clusters. The greater the BETWEEN variance, the better.
- Elbow Technique: Stop increasing K when adding a new cluster does not add a substantial amount of BETWEEN variance. A visual technique.