R Programming and Machine Learning Essentials
What is R?
R is a free statistical software and programming language widely used by academics and professionals in statistics and data science.
Data Types in R
R stores data using different structures and types. Let’s start with data types:
- Integer: Whole numbers (e.g., 3, 5, 7).
- Double: Rational numbers with decimals (e.g., 8.2, 3.55).
- Character: Text data (e.g., “Juan”, “Laura”).
- Logical: Boolean values (TRUE or FALSE).
Data Structures in R
R uses various data structures to organize data:
- Vectors
- Factors
- Lists
- Data Frames
- Matrices
1) Vector
The fundamental data structure in R, storing an ordered set of elements of the same type.
- A vector can contain any number of elements.
- An empty vector is represented as NULL.
- Missing values are coded as NA.
2) Factors
A special structure for storing categorical data, reducing memory usage and essential for certain algorithms.
- Factors allow adding levels to identify missing categories.
3) Lists
A collection of vectors, not necessarily of the same type, useful for specific tasks.
4) Data Frames
The most useful structure for this course, similar to a database with rows (individuals) and columns (variables).
- Data frames are lists of vectors with equal length.
- Rows and columns can be named.
5) Matrices
Similar to data frames but store values of a single type, typically used for mathematical operations.
Introduction to Machine Learning
Machine learning algorithms are designed for specific situations and perform calculations based on given data.
Machine Learning in Practice
Steps to execute a machine learning algorithm:
- Data collection
- Data exploration and preparation
- Model training
- Model evaluation
- Model improvement
Types of Algorithms
- Supervised Learning: Uses labeled data to find relationships between variables (e.g., classification, prediction).
- Unsupervised Learning: Finds patterns in data to understand a reality (e.g., market basket analysis, segmentation).
Train and Test Data
Data is split into training and testing sets to measure model effectiveness:
- Train: Used to generate the model (typically 2/3 of the data).
- Test: Used to check the model’s predictive quality (typically 1/3 of the data).
Understanding k-NN Algorithm
The k-NN algorithm classifies unlabeled individuals based on similarities to labeled individuals.
Measuring Similarities
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Mahalanobis distance
Cluster Analysis
Summarizes information by grouping individuals based on similarities.
K-Means
An unsupervised learning algorithm for learning and segmentation.
Choosing the Right Number of Clusters
For Marketing: Focus on the convenience and definability of segments.
For Statisticians: Maximize between-cluster variability and minimize within-cluster variability.
- Both approaches are crucial for effective decision-making.
Choosing Clusters (II)
- Clusters should be similar within and different between.
- Elbow Technique: A visual method to determine the optimal number of clusters.
- Clusters must be useful and definable.
- Clusters should not be too small unless significantly different.