Machine Learning Fundamentals: Data Types, Algorithms, and Clustering

Data Types

Integer: Whole numbers, without decimals.
Double: Rational numbers, i.e., with decimals.
Character: Data in text format. (“Juan”)

Logical: Values of True (TRUE) or False (FALSE)

Data Structures

  1. Vector: The fundamental data structure in R. It stores an ordered set of values called elements. A vector can contain any number of elements, but all elements must be of the same type.
  2. Factors: A special data structure where R recognizes that the stored values represent different categories of a qualitative variable. For example: FEMALE, MALE.
  3. Lists: A data structure with limited and specific use cases, rarely used in this course.
  4. Data Frames: A typical database structure with rows representing individuals and columns representing variables. It can be understood as a list of vectors with the same length. It has two dimensions (rows and columns).
  5. Arrays: Similar to data frames but can only store values of one type.

Machine Learning

Uses and Abuses of Machine Learning:
Machines are very limited in their ability to understand problems. They are pure power without control. A machine may be more adept than a human at finding subtle patterns in data, but they still require a human to drive analysis and turn results into action.

Real Success Stories of Machine Learning:

  • Customer segmentation for targeted advertising
  • Reducing fraudulent credit card transactions
  • Development of algorithms for self-piloting drones and vehicles
  • Optimizing energy use in homes and offices
  • Discovery of genetic sequences associated with diseases

These are the steps to execute a Machine Learning algorithm:

  1. Data collection
  2. Data exploration and preparation
  3. Model training
  4. Model evaluation
  5. Model improvement

Types of Algorithms:
Supervised learning: Seeks a relationship between explanatory variables and one (or more) response variables. The data is labeled, and we know what we are looking for. Examples: Classification or numerical prediction.
Unsupervised learning: Aims to find patterns in the data to better understand a certain reality and make better decisions. Examples: Market basket analysis, segmentation.

Train Data and Test Data
To measure a model’s effectiveness, divide data into two groups:

  • Train: Used to generate (train) the model. Normally 2/3 of the total sample, chosen at random.
  • Test: Used to check the predictive quality of the model. This data has NOT been used to generate the model; it is NEW data. Normally 1/3 of the total sample, chosen at random.

Understanding How the Algorithm Works

  • Classify unlabeled individuals by assigning them to a group whose already labeled individuals have similar characteristics.

Advantages: Simple and effective, requires no initial assumptions about the underlying distribution, fast execution.

Disadvantages: Requires appropriate choice of k, does not produce a model, limiting the ability to understand how features relate to category, non-numeric variables and missing data require additional processing.

Measuring Similarities Across Distances
Euclidean distance (very common, computationally expensive if the number of dimensions is large).

Choosing the Value of k Appropriately:

  • The decision about the value of k will determine how well the algorithm will generalize to future data.
  • Correct balance between overthinking and underthinking. Large k’s tend to overrepresent the most popular category; small k’s tend to overrepresent the impact of “rare” individuals.


eCHgbp3bJ7xEfyRHYf1lQfpLE+6M0Lsn6QhAZfv13epvW4dysBsI8iPRP7CodcJgpKaCZI4AmTgADYMXigAmDHQMjCLQQJPIYSEAA7Bg9FAOwYCFm4hSCBx1ACAmDH4KEIgB0DIQu3ECTwGEpAAOwYPBQBsGMgZOEWggQeQwn8f16iPnkXUiTOAAAAAElFTkSuQmCC

Precision: TP / (TP+FP) Measures the proportion of true positives among all positives detected by the model.
Accuracy: (TP+TN)/(TP+TN+FP+FN) Measures the proportion of correct answers (well-classified observations) of the model.
Specificity: TN / (TN+FP) Measures the proportion of true negatives among all the real negatives.
Sensitivity or recall: TP/(TP+FN) Measures the proportion of true positives among all the real positives.

Cluster Analysis: Attempts to summarize the information contained in the rows of a database. This summary consists of grouping individuals based on their similarities (again measured with distances).

Introduction to the Algorithm:
K-Means is an unsupervised learning algorithm. This means that we do not know what we are looking for (unlike k-NN). Its purpose is not predictive or classificatory; it is learning and segmentation.

Advantages: Uses simple principles that can be explained without too much statistical terminology. Highly flexible, behaves well in many real-world cases.

Disadvantages: Not as sophisticated. Requires a reasonable estimate of the number of clusters that exist. Not the most suitable for non-spherical clusters.

For Marketing People: The main (and sometimes almost only) objective of a cluster analysis lies in the convenience of the segments found, that is, in the possibility of defining them in a unique and precise way.
For Statisticians: The most important thing is the variability. Once we have the final result, we can divide the total variance of the data into two: BETWEEN and INTRA. We want to maximize BETWEEN variability and minimize INTRA variability. A 70-30 balance between the two is generally considered acceptable, although it depends on the case.

Choosing the Right Number of Clusters (II)
There is no definitive answer. However, consider these aspects:

  • Ideally, clusters should be very similar WITHIN and very different BETWEEN. Individuals within a cluster should be very similar and very different from those in other clusters. The greater the BETWEEN variance, the better.
  • Elbow Technique: Stop increasing K when adding a new cluster does not add a substantial amount of BETWEEN variance. A visual technique.


twP8DZZF0EphUOc4AAAAASUVORK5CYII=