Principal Component Analysis and Cluster Analysis

Posted on Jan 16, 2025 in Mathematics

Principal Component Analysis (PCA)

PCA is a mathematical method that uses an orthogonal transformation to convert a set of correlated variables into uncorrelated variables called the principal components. The first component has the highest variance (it captures the most variation in the data), followed by the second, third, and so on. The components must be uncorrelated (remember orthogonal direction). Normalizing data becomes extremely important when the predictors are measured in different units. The dataset should contain numeric variables only. If there are any non-numeric variables, we need to exclude them with bracket rotation or with the subset function. Eigenvectors consist of p values which represent the variability on each component. They are uncorrelated (orthogonal). The idea is to choose a small subset of the most useful components and thereby compress the high-dimensional data down to a shorter description.

Using the `prcomp` Function

The prcomp function is used to obtain the principal components. By default, it centers the variable to have a mean equal to zero. With the parameter scale. = T, we normalize the variables to have a standard deviation equal to 1. So, first, we need to standardize the variables to have a variance of 1.

Print(hepPCA) returns the standard deviation of each of the PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

The rotation measure provides the principal component loadings. Each column of the rotation matrix contains the principal component loading vector.

Scree Plot

The plot shows the variance explained by each component. It is known as a scree plot. Components 1 and 2 capture almost all the variability in the data, so we will keep them. (The criterion: only a component that explains more than 1.0 units of variance is really helpful, but it is not absolute).

Summary Method

The summary method describes the importance of the PCs, giving information on how much variance each component covers. The first row describes again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component. The third row describes the cumulative proportion of explained variance.

Biplot(hepPCA) shows the PCA projections for both data points and variables in the same plot.

We can use the predict function if we observe new data and want to predict their PCs values.

Cluster Analysis

K-means

This procedure is much more efficient when we have large datasets. It is an algorithm that tries to cluster data based on their similarity. The clusters in this procedure do not form a tree. The main idea is to generate a k number of clusters based on a mean center. We want each cluster to be as far as possible from each other.

We need to specify how many clusters we want to consider.

Step 1: Partition the items into K initial clusters.
Step 2: Assign each item to the cluster whose centroid (mean) is closest.
Step 3: Update the cluster centroids.
Step 4: Reassign clusters, recalculating the cluster centroid for the cluster receiving that item and the cluster losing that item.
Step 5: Update the cluster centroids.

Repeat the process until no more reassignments are made.

To choose the optimal number of clusters, we calculate the variability in the groups for different runs of the K-means function, and we choose the number of clusters for which the variability is small and the number of clusters not too high. To this aim, we generate a vector, SSW, to store the sum of variances inside the groups, and then we plot the values stored in the vector.

Hierarchical Clustering

We don’t need to specify how many clusters we want.

Step 1: It starts out by putting each observation into its own separate cluster.
Step 2: It then examines all the distances between all the observations and puts together the two closest ones to form a new cluster.
Step 3: Once a cluster has been formed, we’ll determine which observation will join it based on the distance between the cluster and the observation.

Some of the methods that have been proposed to do this are to take the minimum distance between an observation and any member of the cluster, to take the maximum distance, to take the average distance, or to use some kind of measure that minimizes the distances between observations within the cluster.

Euclidean Distance

Using Euclidean distance only makes sense if all the variables have the same units. This algorithm starts by using the distance matrix between all the points and choosing the smallest distance. Then, these two points are joined together and the algorithm tries to find the second least difference. The algorithm continues to do this until we have the dendrogram. It is a function to calculate the distance between two points, first in the plane and then in space. It also serves to define the distance between two points in other types of spaces of three or more dimensions and to find the length of a segment defined by two points of a line, the plane, or spaces of greater dimension. Its bases are found in the application of the Pythagorean Theorem.

We use the hclust method to compute its distance to other objects. It defines the cluster distance between two clusters to be the maximum distance between their individual components.

Dendrogram

This is a tree that lists the objects which are clustered along the x-axis, and the distance at which the cluster was formed along the y-axis. If there’s a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data.

Mahalanobis Distance

The Mahalanobis distance takes the co-variances into account. Using Mahalanobis distance makes sense if the variables have different units. It is a way to determine the similarity of two multidimensional random variables. It differs from the Euclidean distance in that it takes into account the correlation between the random variables.