Sin título 1
Covariance matrix(S)–>In diagonal it have variances (spread) and in off-daigonal covariances (pos and orientation). Measure the linera dependency between observations of each pair of vars. Its rank determines the number of linearly independent variables in the data matrix and it is also the number of eigenvalues different from 0. >varaince — >spread — <>Generalizated variance (|S|)–> When |S|=0 there are some variables that are linear combinations of the others. If eigenvector has values too similar their eigenvalue is going to be close to 0. Second eigenvalue is also very close to 0. Its eigenvector is (−0.07, −0.29, −0.07, 0.91, 0.06, −0.1, −0.12, −0.05, −0.22). Assuming all values are close to 0 except the fourth 0.91, we can conclude that this variable has very small variance, it is almost constant through the observed data. Why |S|=0–> Two or more var sum up to a constant. 2 var are identical or differ in the mean or variance. Creating new var that are sums o fthe original ones Correlation matrix R –> Covariances are difficult to interpret, so often it is useful to work with the correlation, which is always between −1 and 1. R is the correlation matrix: square, symmetric, filled with 1 s in the principal diagonal and with pairwise Pearson’s correlation coefficients elsewhere, which measure the lienar relationship between row and col. Coef correlation (Pearson) interpret –>rcorr(as.matrix(medifis[,-1])) 1 – 0.9: very high, strong linear relationship, variables are measuring almost the same thing. 0.9 − 0.6, high: strong relationship, you can be confident that these two variables are connected in some way. 0.6 − 0.3, moderate: moderate relationship, you might be on to something or you might not. If it is close to 0.3 it is better to say it is low or weak, not moderate. 0.3 − 0, negligible: very weak relationship, it may be an artifact of the data set and in fact there is no linear relationship at all. Coeficient of determination R2–> It measures the goodness of a linear combination of the independent variables to predict Xj. Ej: Height (V2) is the best linearly explained by the others (R 2 = 0.93), followed by foot length (V4, R 2 = 0.89). The worst linearly explained by the others is skull diameter (V7, R 2 = 0.51). Q-Q plot–> Q-Q plot is useful when we want to plot a straight line so the closer the points are to the straight line the more the data resembles the distribution we are checking. To check if data are normality distributed
Partial correlations–>cor2pcor(cor(medifis[-1])).removes the external influences of the rest of the variables. Are used to explore the relationships between pairs of variables after we take into account the values of other variables. Should be compared to the corresponding ordinary correlations. Interpretation: 1. Partial and ordinary correlations are approximately equal (the relationship between the variables of interest cannot be explained by the remaining explanatory variables upon which we are conditioning). Partial correlations closer to zero than ordinary correlations. (the relationship between the variables of interest might be explained by their common relationships to the explanatory variables upon which we are conditioning) Partial correlations farther from zero than ordinary correlations. This rarely happens. Ej: Stronger partial relationships are between V5 and V2 ( arm length and height, 0.515) and between V4 and V2 (foot length and height, 0.48) Efective dependence–> 1-det(cor(medifis[-1]))^(1/6) = 0.77 — linear dependences explain 77% of the variability in the data set. Mahalanobis –> can be used to detect outliers. If the data comes from a p-dimensional Normal variate its distribution follows a χ 2 p . Varaibles involves in linear dependences using eigenvectors/values–> 1-Coger eigenvalues cercanos a 0 (0.0…) 2-Vamos a los eigenvectors de esos eigenvalues y vemos los que son diferentes de 0 (mayores a +-0.0999..) y esos seran las varaibles invloves in a possible linear realtionship in the original dataset.BOX-COX: pvalue depende del val de lambda
LR test, lambda = (0) LRT = 3.366743… df = 1 pval = 0.06586
Ho: lambda = 0 H1: lambda !=0 log transf (p-val >0.05)
LR test, lambda = (1) LRT = 18.8333.. df = 1 pval = 1.45633e-0.5
Ho: lambda = 1 H1: lambda != 1 pval to low we reject Ho No useful transf
2 test lambda = 0 (stands for logarithm trans) and lambda = 1 (dont need a transf to improve data). In this case for lambda = 1, p-val is too smal so we reject hypotesis taht lambda = 1 is a good transf value. However we can use log transf because p-val in lamda = 0 > 0.05
PCA –> With PCA we seek to reduce the dimensionality (condense information in variables) of a data set while retaining as much as possible of the variation present in the data. PCA Goals–> Summarize a data set with the help of a small number of synthetic variables (i.e. the Principal Components). Visualize the position (resemblance) of individuals. Visualize how variables are correlated. Interpret the synthetic variables. PCA Applications –> Dimension Reduction 2 Visualization 3 Feature Extraction 4 Data Compression 5 Smoothing of Data 6 Detection of outliers 7 Preliminary process for further analysis. How PCA summarice info –> The first PC is obtained in order to represent the main pieces of variation. The second PC captures a smaller amount of variation. The last PC captures the smallest amoutn of varaition. Main requirement The main requirement for the PCs, Z1, . . . , Zr , is that they need to capture the most variation in the data X Convenient requirement To avoid a PC capturing the same variation as other PCs (i.e. avoiding redundant information), we may also requiere them to be mutually orthogonal, so they are uncorrelated with each other. Papel de cada cosa en PCA –> Eigenvalues provide information about the amount of variability captured by each principal component Scores or PCs provide coordinates to graphically represent objects in a lower dimensional space. Loadings provide the correlation between variables and components. They often help to interpret the components as they measure the contribution of each individual variable to each principal component. However, they do not indicate the importance of a variable to a component in the presence of others. Eigenvectors are unit-scaled loadings. They are the coefficients of orthogonal transformation (rotation) of variables into PC’s. Components to retain –> Include just enough components to explain a predetermined portion of variation, for instance 70 % or 90 %. These percentages will generally become smaller as p or n increases. || Exclude those components whose eigenvalues are less than the average, or., less than one if the components have been extracted from the correlation matrix. || Look for an elbow in the curve plotted in the scree plot. This point is considered to be where large eigenvalues cease and small eigenvalues begin. || Kaiser’s rule retain PCs with eigenval lambdak > 1 (Jolliffe > 0.7)
First principal component is a linear combination of fiber, potassium and protein (and with less and negative weight, variables calories and carbohydrates). Not being a nutritionist, cereals with a high value of this first pc can be classified as healthier than the others. The second PC is a linear combination of sugars, calories and fat (and with less and negative weight, variable carbohydrates), a possible indicator of non healthiness. This two first PC account for a 58.39% of the variance in the data set. It seems than healthier cereals are in shelf 3, followed by shelf 1 and shelf 2 (see dim1 col o dist en la qu a mas pequeña mas presencia de esa componente). The variables that are best represented by this two dimensional graph are: fiber, potassium, sugars and carbohydrates. Fiber and potassium are also highly correlated. Sodium is the variable with the worst representation on this map.