R Programming and Machine Learning Essentials

Posted on Nov 21, 2024 in Mathematics

What is R?

R is a free statistical software and programming language widely used by academics and professionals in statistics and data science.

Data Types in R

R stores data using different structures and types. Let’s start with data types:

Integer: Whole numbers (e.g., 3, 5, 7).
Double: Rational numbers with decimals (e.g., 8.2, 3.55).
Character: Text data (e.g., “Juan”, “Laura”).
Logical: Boolean values (TRUE or FALSE).

Data Structures in R

R uses various data structures to organize data:

Vectors
Factors
Lists
Data Frames
Matrices

1) Vector

The fundamental data structure in R, storing an ordered set of elements of the same type.

A vector can contain any number of elements.
An empty vector is represented as NULL.
Missing values are coded as NA.

2) Factors

A special structure for storing categorical data, reducing memory usage and essential for certain algorithms.

Factors allow adding levels to identify missing categories.

3) Lists

A collection of vectors, not necessarily of the same type, useful for specific tasks.

4) Data Frames

The most useful structure for this course, similar to a database with rows (individuals) and columns (variables).

Data frames are lists of vectors with equal length.
Rows and columns can be named.

5) Matrices

Similar to data frames but store values of a single type, typically used for mathematical operations.

Introduction to Machine Learning

Machine learning algorithms are designed for specific situations and perform calculations based on given data.

Machine Learning in Practice

Steps to execute a machine learning algorithm:

Data collection
Data exploration and preparation
Model training
Model evaluation
Model improvement

Types of Algorithms

Supervised Learning: Uses labeled data to find relationships between variables (e.g., classification, prediction).
Unsupervised Learning: Finds patterns in data to understand a reality (e.g., market basket analysis, segmentation).

Train and Test Data

Data is split into training and testing sets to measure model effectiveness:

Train: Used to generate the model (typically 2/3 of the data).
Test: Used to check the model’s predictive quality (typically 1/3 of the data).

Understanding k-NN Algorithm

The k-NN algorithm classifies unlabeled individuals based on similarities to labeled individuals.

Measuring Similarities

Euclidean distance
Manhattan distance
Minkowski distance
Mahalanobis distance

Cluster Analysis

Summarizes information by grouping individuals based on similarities.

K-Means

An unsupervised learning algorithm for learning and segmentation.

Choosing the Right Number of Clusters

For Marketing: Focus on the convenience and definability of segments.

For Statisticians: Maximize between-cluster variability and minimize within-cluster variability.

Both approaches are crucial for effective decision-making.

Choosing Clusters (II)

Clusters should be similar within and different between.
Elbow Technique: A visual method to determine the optimal number of clusters.
Clusters must be useful and definable.
Clusters should not be too small unless significantly different.