Factor Analysis and PCA Guide

FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS
This booklet contains copies of the slides presented in the classes. They are intended to save you having to scribble down things as I talk. I believe these slides contain technically correct details of the analysis but if you detect any errors please let me know a.s.a.p..
The cautionary review article below is for the keen: Preacher, K.J. and MacCallum, R.C. (2003). Repairing Tom Swifts Electric Factor Analysis Machine. Understanding Statistics 2, 13-43. (Obtainable from http://www.unc.edu/~rcm/papers/tomswift.pdf) The following site: http://www.public.asu.edu/~mwwatkin/ Is for Marley Watkins home page of useful software including a package that does automatic parallel analyses
Chris Fife-Schaw Course texts = Tabachnick, B.G. and Fidell, L.S. (2007). Using Multivariate Statistics (5th edn.). Boston; Pearson/Allyn and Bacon. Field, A. (2009). Discovering Statistics Using SPSS (3rd edn.). London; Sage.
CFS Factor Analysis, last modified November 25, 2010
FACTOR ANALYSIS Aims: In exploratory mode (EFA = Exploratory Factor Analysis): To describe (and possibly explain) the observed correlations between variables as parsimoniously as possible with as few factors as possible. Orderly simplification of a number of interrelated measures (Burt, 1940) To identify the not-directly-observable factors based on a set of observable variables. (Norusis, 1988)
In 'confirmatory' mode (CFA = Confirmatory Factor Analysis): To test the adequacy of a theoretical prediction about the number factors that underlie responses to a set of items from a given domain. To support predictions about which variables (items) are 'caused' by which factors. 'Confirmatory' factor analyses are better conducted using a Structural Equation Modeling (SEM) package such as LISREL or EQS. EXAMPLE ITEMS
a) b) c) d)
Statistics is so much fun I have trouble controlling myself in class I have pleasant dreams about mathematical symbols at night When I attend stats on Monday afternoons I feel like killing myself SPSS Incorporated is probably the single most socially wonderful organisation ever to have straddled the planet If you like using SPSS you must be some sort of social deviant If I had 1000 I would not hesitate to rush out and buy a copy of SPSS for Windows
e) f)
BASIC ASSUMPTIONS 1) 2) 3) The items can be thought of as linear, bi-polar traits You can legitimately correlate the items with one another If A and B are correlated either: A causes B, B causes A or A and B are both caused by something else (there are other possibilities but these will do for now) 4) Underlying dimensions, or factors, can be used to explain [some] complex phenomena. Observed correlations between variables [tests, items etc.] result from their sharing common [causal] factors. 5) Not much point to Factor Analysis if there aren't many big correlations between the items. Your items/variables cover all major aspects of the research domain. Factor analysis cannot reveal factors to account for correlations between variables you have not measured. Obvious but..... DATA REQUIREMENTS 1) The number of items in the analysis should be large enough to allow for at least 3 items per hypothesised factor (often difficult to know in advance). T+F suggest 6 per factor! The sample should have 3 times the number of members as items. Ideally samples should be bigger. T+F say (from Comrey and Lee, 1992): 50 = very poor 100 = poor 200 = fair 300 = good 500 = very good 1000+ = excellent
6)
2)
STAGES TO THE ANALYSIS (EFA) 1) 2) 3) 4) 5) Compute the correlation matrix Extract factors Rotate factors to achieve Simple Structure Interpretation Confirmation: preferably by collecting more data or by conducting CFA on fresh data using a SEM package EXTRACTION (1)
1)
Eigen analysis - program looks for 'clumps', or 'chunks', of variance in the correlation matrix. Each of these 'clumps' of variance can be thought of as the variance of the factors. The 'clumps' do not overlap - the factors account for separate portions of variance in the matrix.
2)
Each item is correlated with each 'clump' or factor. These are called factor loadings (or structural coefficients - see later)
Factor analysis solutions are found using matrix algebra manipulations but I will spare you this.
A (POSSIBLY CONFUSING!) WAY TO THINK ABOUT FACTOR EXTRACTION Think of each bi-polar variable as a line with two ends(!). Lay them out so that the angle () between them represents the correlation between the two variables so that cosine = r.
B Recall cosine 0 degrees = 1 cosine 90 degrees = 0
r, is an (indirect) indication of shared variation between two variables.
Can I come up with an imaginary variable (a factor or axis or component) that shares a lot of variability with both A and B at the same time and might even be the cause of A and Bs correlation? To do this I would need to create a new imaginary new variable (a line) that had the smallest possible angle/highest possible correlation with A and B.
So that I can locate my two original lines in space precisely I need to create another factor/axis/component that is unrelated to this first factor/axis/component and therefore uncorrelated with it. Following the cosine = r rule this means that the second axis must be at right angles to the first one.
Loadings (Structural Coefficients see later) Essentially this is how much of each factor does each variable have
+1
-1
+1
B -1
All methods start by identifying as many factors/axes/components as there are items. This doesn't help to simplify your data so there are a number of ways to help you decide: a) Extract only factors that have EIGEN values greater than 1. This is sometimes referred to as KAISER'S criterion or the KAISER-GUTTMAN rule. Eigen values are an indication of the proportion of variance in the correlation matrix that is explained by the factor. As every variable contributes a variance of 1 in a correlation matrix there is little point in extracting a factor that explains less variance than one variable does. Scree test. Suggested by Raymond Cattell, it is a graphical method that uses a plot of Eigen values against numbers of factors. This gives a descending gradient that looks like the pile of scree. When the gradient levels out, additional factors are not really telling you much more about the data. Extract the number of factors indicated before the graph levels out.
b)
Scree Plot
Eigenvalue
0 1 2 3 4 5 6 7 8 9 10
Factor Number
c)
Look for factors that have only one or two items loading highly on them - these factors, if theoretically uninterpretable (see below) can probably be omitted. If you have a theoretical prediction about the number of factors, extract that number and see if these produce Simple Structure and interpretable factors. This assumes you already have a theory about how many factors there are and is thus a weak kind of confirmatory factor analysis. The parallel approach involves comparing your data set with an equivalent sized data set made up of random normally distributed numbers (SPSS syntax for doing this is included below). You run the EFA on both data sets and then compare the tables of eigenvalues associated with the factors. You extract only those factors where the eigenvalue in your data set exceeds that in the random data set. For example, if your 3rd factor has an eigenvalue of 1.7 and the 3rd factor in the random data set FA has a value of 1.5 then you can extract the third factor. If your 4th factor has a value of 1.3 and the 4th factor in the random data set FA has a value of 1.4 then you would not extract the 4th factor.
d)
e)
SPSS syntax for producing a random data matrix. TITLE Parallel Analysis - After Thompson and Daniel, 1996. INPUT PROGRAM. + LOOP #1 = 1 to 100000. COMMENT Change 100000 above to the n in the sample being modeled. + DO REPEAT V = V1 TO V50. COMMENT Change V50 above to be V-whatever the number of your variables is. + COMPUTE V = RND (NORMAL(5/6) + 3). COMMENT Change 5 above to the maximum response value, and 3 to the COMMENT middle response value (e.g. 3 is the mid point for a 1 to 5 COMMENT Likert style scale). + IF (V LT 1)V = 1. + IF (V GT 5)V = 5. COMMENT Change the 5s above to the max poss response value. + END REPEAT. + END CASE. + END LOOP. + END FILE. END INPUT PROGRAM.
Notes: 1) Some versions of SPSS require the use of the SET SEED command to generate different random numbers on a given run - see manuals. 2) No data will appear in the data window until you ask for an analysis to be done.
A WORD ON TERMINOLOGY
Each observed variable X (item/question etc.) is thought of as being the result of a linear combination of unobserved common factors and some unique-to-the-variable factor.
Xi = Ai1F1 + Ai2F2 + ..............
+ AikFk + Ui
Communality
Uniqueness
a)
Having worked out how much communality each item has, you can chose to ignore the item's unique and error variance. If you chose to include the uniqueness and error variance you are doing a PRINCIPAL COMPONENTS ANALYSIS Factor Analysis and Principal Components Analysis are often (usually) confused. Remember, 'true' Factor Analysis excludes uniqueness and error variance while Principal Components Analysis does not. However, the term 'Factor Analysis' is often used generically to describe all these classes of analysis while 'Principal Components' is more normally used only when it is technically correct to use it.
b)
The extraction method you chose depends on the measurement theory underlying your research. Accepted wisdom seems to be that Principal Components Analysis is the most robust providing you have a reasonable sample size (>200). Use it when you are primarily interested in reducing a large number of items/variables to a smaller number of components. If you are using EFA to generate theory then Preacher and MacCallum (2003) argue for the use of a true factor analysis approach.
There are other methods of extraction: Maximum Likelihood, Principal Axis Factoring, Alpha Factoring, Image Factoring (see T+F and SPSS manuals for details).
ROTATION THURSTONE'S SIMPLE STRUCTURE Invariance - when a factor appears it should appear when similar analyses are done on related items Uniqueness - if same domain is investigated identical configurations should appear (note this is a different use of the term 'uniqueness') 1) 2) 3) Each item should not load substantially on all the factors If there are N factors there should be at least N non-significant loadings on each factor For every pair of factors there should be several variables with zero loadings on one factor but, at the same time, significant loadings on the other For every pair of factors a large proportion of the loadings should have non-significant (~0) values on both factors where there are 4+ factors For every pair of factors there should only be a small proportion of loadings with significant values in both factors
4)
5)
There are several different ways to rotate the factors. 1) VARIMAX - this is one of the most commonly seen and used to be the default in SPSS. It assumes that you want 'orthogonal', uncorrelated factors. I.E. the factors share no variance with one another. 2) OBLIQUE - called 'Direct Oblimin' in SPSS for Windows. This provides a 'nonorthogonal' solution and the rotated factors can be correlated. In the real world it is rare for a psychological theory to propose truly uncorrelated factors. Forcing rotated factors to be uncorrelated may misrepresent your data. Unless you have a strong 'orthogonal' hypothesis opt for Oblique rotation. This is not the default option in SPSS! There are other methods but use these only if you know what you are doing.
INTERPRETATION Rotation was done to make the factors more interpretable. Now you must interpret. Look at the matrix of loadings (structural coefficients - see later) of the items on the factors. If you did an oblique rotation you should look at the 'pattern matrix' which is a matrix of unique loadings (i.e. the unique contribution of the factor to the variance in the item/variable). 1) Look at the loadings of items on each factor. If you have achieved simple structure there should be few (no) items loading highly on more than one factor 'High' conventionally means above 0.4. Remember these are like (but not the same as) correlations, so as a guide the loading squared gives a rough estimate of the proportion of variance the item shares with the factor. 0.4 squared give 0.16 - 16% shared variance. Some people argue that you can go as low as 0.3 but this is not advised. There is little point in trying to interpret a factor using items that share less variance than this. 2) Start with the highest loading item (in absolute terms) and work down to the lower loading items. High loading items probably have more 'to do' with the factor.
OBLIQUELY ROTATED FACTOR MATRIX (OBLIMIN) PATTERN MATRIX Factor I Qa Qb Qc Qd Qe Qf 0.92 0.96 -0.95 -0.00 -0.01 0.01 Factor II 0.08 -0.06 0.01 0.86 -0.95 0.92
FACTOR CORRELATION MATRIX: FACTOR I FACTOR I FACTOR II 1.00 0.46 1.00 FACTOR II
COMMON PITFALLS/PROBLEMS 1) Confusing confirmatory and exploratory approaches. Dont be disappointed when a reasonable looking EFA is not confirmed in a CFA - the criteria for getting a good CFA are tough Inappropriate samples Using small samples Strange item distributions - items don't have to be normally distributed but the solution is likely to be better if they are. Bi-modal or extremely skewed or kurtotic items are usually better excluded. Correlations - relationships must be linear - check bi-variate scatterplots Exclude items that are perfectly correlated with other items. Note none of the items should be mathematical combinations of other items in the analysis. Trying to do FA when correlation matrix has only small (<0.3) values in it. Check the KMO statistic. Outliers have an undue influence on correlations. Screen them first. Communalities of <0.3 - item is unreliable Confusing factors with scales Struggling to fit factors to prior hypotheses Trying to define a factor with only one or two items loading significantly on it Failing to achieve Simple Structure
2) 3) 4)
5) 6)
7)
8) 9) 10) 11) 12) 13)
GOOD REPORTING PRACTICES The following article outlines what is now regarded as good practice for publishing factor analyses (both EFA and CFA) in APA and other classy journals. The article is: Thompson, B. and Daniel, L.G. (1996). Factor analytic evidence for the construct validity of scores: A historical overview and some guidelines. Educational and Psychological Measurement, 56(2), 197-208. (It is in the library!)
1)
Do not refer to pattern matrix coefficients as loadings. The term loadings has multiple meanings and what most of us call loadings are better termed structural coefficients. Always report full tables of structural coefficients (loadings) - dont use the BLANK command in SPSS. The use of the Kaiser-Guttman rule for selecting the number of factors to extract (the default in SPSS) is naff. The article recommends either bootstrapping approaches (not easily done in SPSS) or parallel approaches which can be done in SPSS. Always report the method of factor extraction (e.g. principal components, principle axis etc.) and the method of rotation (e.g. Varimax, Oblimin). Report the criteria used for deciding how many factors to extract (multiple methods, if available, are to be preferred). Report all item communalities. Label factors with Roman numerals (e.g. Factor I, Factor II ..) Report coefficients to 2-3 dec. places - no more.
2)
3)
4)
5) 6) 7)
FACTOR ANALYSIS SPSS EXERCISE
This simple exercise will get you to use SPSS for Windows to run a factor analysis on a set of 10 items. These items all deal with people's perceptions of themselves and the example data comes from 239 teenage females in Swindon (ESRC 16-19 data set). Note that we have nearly 24 times as many cases as items. The Data Responses to each item were on a 5-point Likert scale with respondents being asked to indicate how much they agreed or disagreed with the following statements: Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 If I can't do a job the first time I keep trying until I can do it I do not know how to handle social gatherings I give up easily I am happy to be the person I am I feel that I am as worthwhile as anybody else I seem to be capable of dealing with most problems that come up in life I avoid trying to learn new things when they look too difficult for me If I could, I would be a very different person to the one I am now I sometimes cannot help but wonder if anything is worthwhile I am often troubled by the emptiness in my life
A score of 1 indicated strong agreement with the item, 5 indicated strong disagreement. A copy of the SPSS for Windows data file is to be found on the network drive and is called: O:\cfslabcl\factor.sav There are no missing data. 1) Start SPSS for Windows and retrieve the data file. Throughout the session please try not to change these data as they will be used by other classes. Conduct a factor analysis on these 10 items. Although I know that there is a structure that underlies these items, you will have to deal with this factor analysis as an exploratory factor analysis (EFA). Work through the menus to find FACTOR and see what you need to include. Remember factor analysis is fundamentally about reducing data (this is a clue). As a first stab at an exploratory analysis I would suggest that you use the Principal Axis Factoring extraction method, adopt the (rather unsatisfactory) Kaiser-Guttman eigenvalues greater than 1 extraction criteria and rotate the factors obliquely (Direct Oblimin). Ask for the scree plot and get SPSS to display the inter-item correlation
2)
matrix. To make life easier you can ask for the structural coefficients to be ordered by size in the tables (in the Options). If you have time later you can try the parallel approach described in the notes. Ask for the factor loading plots too. In the descriptives box tick the box for the KMO test. This is the Kaiser-MeyerOlkin measure of sampling adequacy and figures less than 0.5 suggest that there is too little common variance in the correlation matrix to make factor analysis sensible it is a kind of check to see whether you perhaps shouldnt be doing EFA in the first place. Try to interpret the factors bearing in mind that you are not really concerned with interpreting the items which have low (<0.4) structural coefficients/loadings on the factors. Read the 'Pattern Matrix' not the 'Structure matrix'. REMEMBER - in a real research situation you would not simply do one factor analysis like this and leave things at that. The present example has been selected because it is particularly clear(?) - normally you will experience much more confusing factor patterns and you may have to run several different factor analyses using different criteria (e.g. numbers of factors extracted) before you are satisfied that you know the structure underlying your data. 3) Re-run the analysis using the VARIMAX rotation criteria and see how it differs from the Direct Oblimin solution given above. Remember to read the rotated factor matrix here because there is no pattern matrix to look at with VARIMAX rotations. If in doubt at any stage press the 'HELP' button before asking one of the demonstrators to help.
The Exercise Please fill in the answers to the following before you leave today. Beware trick questions! 1) Which two items are the most highly correlated with each other? ______________ Is this a surprise? __________ Why? ____________________________
2)
How many factors does the default 'eigen values greater than 1' selection criteria suggest should be extracted? ___ (=X) How many factors does the Scree plot suggest? ___
3)
Look at the X factor solution. (X = the number suggested by the answer to the first part of Q2) What percentage of the total variance is accounted for by the first X factors? ____% What is/are the correlation(s) between these X factors? _____ Has the analysis achieved 'Simple Structure'? YES NO
4)
How would you interpret factor I? _________________________ And factor II? ______________________________ And factor III? ______________________________ And factor IV? ______________________________
5)
What is the main difference between the oblique and varimax solutions? ___________________________________________________________________ Why is this? ___________________________________________________________________ Which solution is better and why? ___________________________________________________________________

Factor Analysis and PCA Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Factor Analysis and PCA Guide

Uploaded by

Copyright:

Available Formats

FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

B Recall cosine 0 degrees = 1 cosine 90 degrees = 0

r, is an (indirect) indication of shared variation between two variables.

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

Xi = Ai1F1 + Ai2F2 + ..............

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

8) 9) 10) 11) 12) 13)

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

FACTOR ANALYSIS SPSS EXERCISE

CFS Factor Analysis, last modified November 25, 2010

CFS Factor Analysis, last modified November 25, 2010

You might also like