Professional Documents
Culture Documents
Michael Friendly
-3.1
York University
Always fun
2.3
7.0
Admit?: Yes
Brown
Admit?: No
Wifes rating
Very Often
4.4
-2.2
Never fun Fairly Often Very Often Always fun
40 35 30
557
-3.1
Husbands Rating
Topics:
2.3 7.0
Sqrt(frequency)
25 20 15 10 5 0 -5 0 2 4 6 8 10 Number of males 12
4.4
2 2 tables and fourfold displays Sieve diagrams Observer agreement Correspondence analysis
Brown
2 / 58 2 x 2 tables
n-way tables
Correspondence analysis and MCA (CORRESP macro) R: most of these in the vcd package
fourfold(), sieve(), mosaic(), agreementplot(), . . . more general Correspondence analysis: ca package
3 / 58
4 / 58
2 x 2 tables
Standard analysis
Output:
Statistics for Table of gender by admit Statistic DF Value Prob -----------------------------------------------------Chi-Square 1 92.2053 <.0001 Likelihood Ratio Chi-Square 1 93.4494 <.0001 Continuity Adj. Chi-Square 1 91.6096 <.0001 Mantel-Haenszel Chi-Square 1 92.1849 <.0001 Phi Coefficient 0.1427
How to analyse these data? How to visualize & interpret the results? Does it matter that we collapsed over Department?
Quarter circles: radius nij area frequency Independence: Adjoining quadrants align Odds ratio: ratio of areas of diagonally opposite cells Condence rings: Visual test of H0 : = 1 adjoining rings overlap
Sex: Male 1198 1493
Data in Table 2 had been pooled over departments Stratied analysis: one fourfold display for each department Each 2 2 table standardized to equate marginal frequencies Shading: highlight departments for which Ha : i = 1
Department: A
Sex: Male 512 Admit?: Yes 313 Admit?: Yes Admit?: No 353
Department: B
Sex: Male 207 Admit?: Yes Admit?: No
Department: C
Sex: Male 120 205 Admit?: No 202 391 Sex: Female
89
19 Sex: Female
17 Sex: Female
Admit?: Yes
Admit?: No
Department: D
Sex: Male 138 Admit?: Yes 279 Admit?: Yes Admit?: No
Department: E
Sex: Male 53 138 Admit?: Yes Admit?: No
Department: F
Sex: Male 22 351 Admit?: No 24 317 Sex: Female
1278
131
244
94
Sex: Female
Only one department (A) shows association; A = 0.349 women (0.349)1 = 2.86 times as likely as men to be admitted.
8 / 58
2 x 2 tables
Fourfold displays
2 x 2 tables
Fourfold displays
*/ */ */ */
Other graphical methods can show these eects. (This ignores possibility of structural bias against women: dierential funding of elds to which women are more likely to apply.)
8 9 10 11 12 9 / 58 2 x 2 tables Odds ratio plots
Aggregate data: rst sum over departments, using the TABLE macro:
%table(data=berkeley, out=berk2, var=Admit Gender, /* omit dept */ weight=count, /* frequency variable */ order=data); %ffold(data=berk2, var=Admit Gender);
10 / 58 Two-way tables
Black 5 15 20 68 108
Blond 16 10 94 7 127
0.5
With a 2 test (PROC FREQ) we can tell that hair-color and eye-color are associated. The more important problem is to understand how they are associated. Some graphical methods:
A B C D E F
1.0
0.5
0.0
Department
Two-way tables
Sieve diagrams
Two-way tables
Sieve diagrams
Sieve diagrams
Height/width marginal frequencies, ni + , n+j Area expected frequency, m ij ni + n+j Shading observed frequency, nij , color: sign(nij m ij ). Independence: Shown when density of shading is uniform.
Green Hazel
5 15 29 54 14 14 16 10
Eye Color
Eye Color
Blue
39.2
103.9
25.8
46.1
215
This display shows expected frequencies, assuming independence, as # boxes within each cell The boxes are all of the same size (equal density)
Blue
20
84
17
94
Brown
40.1
106.3
26.4
47.2
220
Brown
68
119
26
108
286
71
127
592
Black
Red
Blond
Black
Brown
Red
Blond
Hair Color
13 / 58 14 / 58 Two-way tables Sieve diagrams
Two-way tables
Sieve diagrams
Sieve diagrams
Eect ordering: Reorder rows/cols to make the pattern coherent
Sieve diagrams
Vision classication data for 7477 women
Unaided distant vision data
High
Right Eye Grade
Blue
20
84
17
94
Eye Color
Green Hazel
5 15
29 54
14 14
16 10
Brown
68
119
26
Low
Black Brown Red Blond
High
Low
Hair Color
15 / 58
Two-way tables
Sieve diagrams
Two-way tables
Sieve diagrams
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A Dept
F E D
18 / 58 Observer Agreement
Two-way tables
Sieve diagrams
Observer Agreement
Inter-observer agreement often used as to assess reliability of a
subjective classication or assessment procedure
square table, Rater 1 x Rater 2 Levels: diagnostic categories (normal, mildly impaired, severely impaired)
Admitted
F E D
Admit
A Dept
Measures of Agreement:
Intraclass correlation: ANOVA framework multiple raters! Cohens : compares the observed agreement, Po = pii , to agreement expected by chance if the two observers ratings were independent, Pc = pi + p+i . Po Pc = 1 Pc
Rejected
19 / 58
20 / 58
Observer Agreement
Cohens kappa
Observer Agreement
Cohens kappa
Cohens
Properties of Cohens :
perfect agreement: = 1 minimum may be < 0; lower bound depends on marginal totals Unweighted : counts only diagonal cells (same category assigned by both observers). Weighted : allows partial credit for near agreement. (Makes sense only when the categories are ordered .)
Cohens : Example
The table below summarizes responses of 91 married couples to a questionnaire item, Sex is fun for me and my partner (a) Never or occasionally, (b) fairly often, (c) very often, (d) almost always.
Weights:
Cicchetti-Alison (inverse integer spacing) vs. Fleiss-Cohen (inverse square spacing)
1 2/3 1/3 0 Integer Weights 2/3 1/3 1 2/3 2/3 1 1/3 2/3 0 1/3 2/3 1 Fleiss-Cohen 1 8/9 8/9 1 5/9 8/9 0 5/9 Weights 5/9 0 8/9 5/9 1 8/9 8/9 1
--------- Wife's Rating -------Husband's Never Fairly Very Almost Rating fun often Often always | SUM --------------------------------------------------+------Never fun 7 7 2 3 | 19 Fairly often 2 8 3 7 | 20 Very often 1 5 4 9 | 19 Almost always 2 8 9 14 | 33 --------------------------------------------------+------SUM 12 28 18 33 | 91
22 / 58
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
23 / 58
24 / 58
Observer Agreement
Cohens kappa
Observer Agreement
Cohens kappa
Example: Diagnostic classication of mulitiple sclerosis by two neurologists, for two populations (Landis and Koch, 1977)
Winnipeg patients NO rater: Cert Prob Pos Doubt -------------------Winnipeg rater: Certain MS Probable Possible Doubtful MS 38 33 10 3 5 11 14 7 0 3 5 3 1 0 6 10 Cert Prob Pos Doubt -------------------5 3 2 1 3 11 13 2 0 4 3 4 0 0 4 14 New Orleans patients
Analysis:
proc freq; tables strata * rater1 * rater2 / agree;
26 / 58
27 / 58
28 / 58
Observer Agreement
Cohens kappa
Observer Agreement
Cohens kappa
Homogeneity test: H0 : 1 = 2 = = k
Statistic Chi-Square DF Pr > ChiSq -----------------------------------------------Simple Kappa 0.9009 1 0.3425 Weighted Kappa 1.1889 1 0.2756 Total Sample Size = 218
Construction:
BN =
k i
2 nii
k i
ni + n+i
Never fun Fairly Often Never fun Fairly Often Very Often
Very Often
Always fun
Always fun
Husbands Rating
31 / 58 32 / 58
Observer Agreement
Observer Agreement
Wifes rating
Add shaded rectangles, size sum of frequencies, Abi , within b steps of main diagonal weighted measure of agreement,
w BN = k i q b =1
wb Abi ]
ni + n+i
Never fun Fairly Often Never fun Fairly Often Very Often
Very Often
w1
w2
Always fun
Always fun
Husbands Rating
33 / 58 Observer Agreement Observer Agreement Chart Observer Agreement Observer Agreement Chart 34 / 58
agreeplot macro
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Wife
Very Often
proc format; value rating 1='Never_fun' 2='Fairly_often' 3='Very_often' 4='Almost_always'; data sexfun; format Husband Wife rating.; do Husband = 1 to 4; do Wife = 1 to 4; input count @@; output; end; end; datalines; 7 7 2 3 2 8 3 7 1 5 4 9 2 8 9 14 ; *-- Convert numbers to formatted values; %table(data=sexfun, var=Husband Wife, char=true, weight=count, out=table); %agreeplot(data=table, var=Husband Wife, title=Husband and Wife Sexual Fun);
33
18
Fairly Often
28
To preserve ordering, integer values are used for Husband and Wife A SAS format is used to provide value labels The table macro converts numeric character
35 / 58
Never Fun
Husband
36 / 58
Observer Agreement
Marginal homogeneity
Observer Agreement
Marginal homogeneity
Doubtful
1 2 3 4 5 6 7 8 9 10
Winnipeg Neurologist
Possible
Winnipeg Neurologist
Probable
Probable
Certain
11 12 13 14 15 16 17 18
Certain Certain
Probable
Possible
Doubtful
Certain
Probable
Possible
Doubtful
agreemar.sas title2 'Testing equal marginal proportions'; proc catmod data=ms; weight count; response marginals; model win_diag * no_diag = _response_ / oneway; repeated neuro 2 / _response_= neuro;
agreemar.sas title2 'Testing equal means'; proc catmod data=ms; weight count; response means; model win_diag * no_diag = _response_ / oneway; repeated neuro 2 / _response_= neuro;
Output:
Testing equal marginal proportions Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------Intercept 3 222.62 <.0001 Neuro 3 10.54 0.0145 Residual 0 . .
Output:
Testing equal means Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------Intercept 1 570.61 <.0001 Neuro 1 7.97 0.0048 Residual 0 . .
Correspondence analysis
Basic ideas
Correspondence analysis
Basic ideas
Correspondence analysis
Correspondence analysis (CA)
* HAIR COLOR
Green
Analog of PCA for frequency data: account for maximum % of 2 in few (2-3) dimensions nds scores for row (xim ) and column (yjm ) categories on these dimensions uses Singular Value Decomposition of residuals from independence, dij = (nij mij )/ mij d ij = m xim yjm n m=1
M
optimal scaling : each pair of scores for rows (xim ) and columns (yjm ) have highest possible correlation (= m ). plots of the row (xim ) and column (yjm ) scores show associations
-0.5 -1.0
-0.5
0.5
1.0
Interpretation: row/column points near each other are positively associated Dim 1: 89.4% of 2 (dark light) Dim 2: 9.5% of 2 (RED/Green vs. others)
41 / 58 Correspondence analysis Basic ideas Correspondence analysis Basic ideas 42 / 58
CORRESP macro
Uses PROC CORRESP for analysis Produces labeled plots of the category points in either 2 or 3 dimensions Many graphic options; can equate axes automatically See: http://datavis.ca/sasmac/corresp.html
Raw category responses (case form), or cell frequencies (frequency form), classied by 2 or more factors (e.g., output from PROC FREQ) Obs 1 2 3 4 ... 15 16 Eye Brown Brown Brown Brown Green Green HAIR BLACK BROWN RED BLOND RED BLOND Count 68 119 26 7 14 16
43 / 58
R
The ca package provides 2-way CA, MCA and more plot(ca(data)) gives reasonable (but not yet beautiful) plots Other R packages: caGUI, vegan, ade4, FactoMiner, ...
44 / 58
Correspondence analysis
Basic ideas
Correspondence analysis
Basic ideas
corresp2a.sas data haireye; input EYE $ BLACK BROWN RED BLOND ; datalines; Brown 68 119 26 7 Blue 20 84 17 94 Hazel 15 54 14 10 Green 5 29 14 16 ;
ods rtf; /* ODS destination: rtf, html, latex, ... */ ods graphics on; proc corresp data=haireye short; id eye; /* row variable */ var black brown red blond; /* col variables */ ods graphics off; ods rtf close;
46 / 58
Row and column points are distinguished by the _TYPE_ variable: OBS vs. VAR
Column Coordinates Dim1 BLACK BROWN RED BLOND -.504562 -.148253 -.129523 0.835348
Correspondence analysis
Basic ideas
Correspondence analysis
CA in R
CA in R: the ca package
> HairEye <- margin.table(HairEyeColor, c(1, 2)) > library(ca) > ca(HairEye) Principal inertias 1 Value 0.208773 Percentage 89.37% ... (eigenvalues): 2 3 0.022227 0.002598 9.52% 1.11%
* HAIR COLOR
Green
0.4
Blue
BLOND
0.2
Black
Brown
Blue
Blond
0.0
Brown
Dim 1 (89.4%)
0.4 0.2 0.0 0.2 0.4 0.6 0.8
0.4
-0.5 -1.0
0.2
Hazel
Red
-0.5
0.0
0.5
1.0
Green
50 / 58
Multi-way tables
Correspondence analysis can be extended to n-way tables in several ways:
Stacking approach
n-way table attened to a 2-way table, combining several variables interactively Each way of stacking corresponds to a loglinear model Ordinary CA of the attened table visualization of that model Associations among stacked variables are not visualized
I x J x K table
I J J K J K
The variables combined are treated interactively Each way of stacking corresponds to a loglinear model
(I J ) K [AB][C] I (J K ) [A][BC] J (I K ) [B][AC]
Here, I only describe the stacking approach, and only with SAS
In SAS 9.3, the MCA option with PROC CORRESP provides some reasonable plots. For R, see the ca package the mjca() function is much more general
51 / 58
52 / 58
Correspondence analysis
Multi-way tables
Correspondence analysis
Multi-way tables
PROC CORRESP: Use TABLES statement and option CROSS=ROW or CROSS=COL. E.g., for model [A B] [C],
proc corresp cross=row; tables A B, C; weight count;
M M M M M F F F F F
54 / 58
CA Graph:
0.7
Dimension 2 (33.0%)
%include catdata(suicide); *-- equate axes!; axis1 order=(-.7 to .7 by .7) length=6.5 in label=(a=90 r=0); axis2 order=(-.7 to .7 by .7) length=6.5 in; %corresp(data=suicide, weight=count, tables=%str(age sex, method), options=cross=row short, vaxis=axis1, haxis=axis2);
suicide5.sas
Gas
10-20 * F
25-35 * F
0.0
40-50 * F Jump
40-50 * M
Output:
Inertia and Chi-Square Decomposition Singular Values 0.32138 0.23736 0.09378 0.04171 0.02867 Principal ChiInertias Squares Percents 12 24 36 48 60 ----+----+----+----+----+--0.10328 5056.91 60.41% ************************* 0.05634 2758.41 32.95% ************** 0.00879 430.55 5.14% ** 0.00174 85.17 1.02% 0.00082 40.24 0.48% ------------0.17098 8371.28 (Degrees of Freedom = 45)
55 / 58
Hang 55-65 * M
70-90 * M
56 / 58
Correspondence analysis
Multi-way tables
Summary: Part 2
Summary: Part 2
>8
Fourfold displays
Odds ratio: ratio of areas of diagonally opposite quadrants Condence rings: visual test of H0 : = 1 Shading: highlight strata for which Ha : = 1
FEMALE
4:8
2:4
Sieve diagrams
Rows and columns marginal frequencies area expected Shading observed frequencies Simple visualization of pattern of association SAS: sieveplot macro; R: sieve()
Agreement
Cohens : strength of agreement Agreement chart: visualize weighted & unweighted agreement, marginal homogeneity SAS: agreeplot macro; R: agreementplot()
<-8
Standardized residuals:
Correspondence analysis
Decompose 2 for association into 1 or more dimensions scores for row/col categories CA plots: Interpretation of how the variables are related SAS: corresp macro; R: ca()
57 / 58 58 / 58
10-20
25-35
40-50
55-65
>65