How R Has Changed How We Look at Principal Component Analyses

A typical PCA table that I trawled from the internet.

This post isn’t really very anole-specific, but because lots of studies of anoles use principal component analyses, I think it’s at least tangentially relevant.

PCA is a way of to reduce the variation in a data set to a few dimensions by constructing new variables that combine variables that are highly correlated with each other into a smaller number of variables called PC axes. I won’t go into the details of the method here, because Ambika Kamath explains all in a post she wrote on her blog a while back.

What I want to mention here is how we interpret these new statistical axes. Back in my day, computer programs spit out a matrix of numbers like the one above, which we called “loadings.” These values represented how strongly an individual’s value for each variable was correlated with the individual’s score on the new axes. So, for example, in the table above, values on PCA axis one correlate most strongly with an individual’s values for the top four variables (sodium, calcium, magnesium and chloride concentrations) and most weakly with melt percentage and some other variables.

Now, everyone uses the computer program R to conduct PCAs, and R, too, spits out “loadings.” But those are not your father’s loadings (or my loadings). Rather, those values are the coefficients of the new equation that defines the PCA axis (a PCA axis is a linear regression of all the variables). Thus, in the example above, individuals that scored high on PCA 1 would have the largest largest concentrations of the top four variables; an individuals melt percentage would have little impact on an individual’s score on PC I. Back in the day, we could also access those values, but we called them “coefficients.”

Does this really matter? Only to the extent that what much of the literature used to call “coefficients” is now called “loadings” and what used to be called “loadings” apparently isn’t routinely spit out by R. And, more importantly, most R users are completely unaware of the switcheroo.

Ambika did a very preliminary analysis to see whether the values of coefficients (new “loadings”) and correlations (old “loadings”) are very different. Her tentative conclusion is that they aren’t, so maybe this doesn’t matter much, but it might be worth looking into more.

Jonathan Losos
Latest posts by Jonathan Losos (see all)

Previous

Reflections on the Shape of Lizard Eggs and Life

Next

A Costa Rican Anole… with Eyespots? (ID Help Please)

4 Comments

  1. One of my issues with PCA is more inherent in why it is being used rather than this issue. I would agree that the results being spat out by R and what was done in the past are both useful statistics. Personally I use SAS as my preferred statistical tool but that is preference.

    My issue stems from whether or not one should be using PCA in a given scenario, in particular for me is questions of species divergence, when DFA (Discriminate Function Analysis) may be the better tool of choice. This is inherent in the assumptions of the statistical test, which must be met for an analysis to hold.

  2. Thank Goodness I am comfortably retired! Skip

  3. A typical output from a Principal Components Analysis from SAS back in the 1980’s contained the elements of the eigenvectors. These were untransformed coefficients (also called directional cosines) that provided the contribution of each variable to the rotation of each axis in the n-dimensional phenotype space. Loadings were introduced by Psychometricians to aid in the interpretation of PCA’s and as an approximation of the solution to Factor Analysis. PCA and Factor Analysis were often used to reduce the dimensionality of a large number of variables.

    The “loadings” produced in the R functions p are the elements of each eigenvector from the decomposition of the covariance (or correlation) matrix. That is, PCA in R is more like the original version of PCA.

    Also major axis regression is more similar to PCA than linear regression. The deviations from the regression line are assumed to be orthogonal to each axis, whereas in linear regression the deviations are measured as the distance of the response variable from the fitted line.

  4. David Cannatella

    The uncertainty about what the column values (“loadings” in a vague sense) are in a PC analysis has been around even since I was in grad school. The rule was that one should always look at the documentation for the software (similar warning for determining whether a correlation or covariance matrix was used for the calculation of the PCs.

    As I remember, the “coefficients” for a column (PC) are part of the equation specifying the axis. They can be used to convert the original data to the scores on PC1, PC2, etc. that are plotted on the biplots.

    The “correlations” for a particular column are the product of the coefficient and the square root of the eigenvalue (variance) for that column. The latter term is equivalent to the standard deviation, so the correlations are “de-standardized” and their magnitudes have meaning.

    So, the coefficients and correlations are related, but the scaling factor (eigenvalue) varies among PCs, so it’s not 1:1 across all PCs.

    Folks who do morphometrics usually find the “correlation” type to be more useful.

    At least that’s how I remember it; someone correct me if I’m wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *

Powered by WordPress & Theme by Anders Norén