You are on page 1of 20

Visualizing Categorical Data with SAS and R

Michael Friendly

Part 4: Model-based methods for categorical data


logit(Admit) = Dept DeptA*Gender
2

Arthritis treatment data Linear and Logit Regressions on Age 1.0 Probability (Improved) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

1 Log Odds (Admitted)

York University

-1

Short Course, 2012


Web notes: datavis.ca/courses/VCD/

-2

Gender -3

Female Male

20
A B C D Department E F

30

40

50 AGE

60

70

80

40 35 30

Hazel Green Blue

Unaided distant vision data


High
Right Eye Grade

-3.1

Topics: Logit models


2.3 7.0

Sqrt(frequency)

25 20 15 10 5 0 -5 0 2 4 6 8 10 Number of males 12

Plots for logit models Diagnostic plots for generalized linear models

Logistic regression models


3
4.4

Low High 2 3 Low


Black Brown Red

-2.2 -5.9 Blond

Logistic regression: Binary response Model plots Eect plots for generalized linear models Inuence measures and diagnostic plots
2 / 77

Left Eye Grade

Logit models

Brown

Logit models

Modeling approaches: Overview

Logit models
For a binary response, each loglinear model is equivalent to a logit model (logistic regression, with categorical predictors) e.g., Admit Gender | Dept (conditional independence [AD][DG])
D G AD DG log mijk = + A i + j + k + ij + jk

So, for admitted (i = 1) and rejected (i = 2), we have:


AD G D & + A log m1jk =  DG 1 + k + 1j + & jk j + AD D G & log m2jk =  + A DG 2 + j + k + 2j + & jk

(7) (8)

Thus, subtracting (7)-(8), terms not involving Admit will cancel: Ljk = = where,
: overall log odds of admission jDept : eect on admissions of department, associations among predictors are assumed, but dont appear in the logit model 4 / 77

log m1jk log m2jk = log(m1jk /m2jk ) = log odds of admission


A AD AD (A 1 2 ) + (1j 2j )

= + jDept

(renaming terms)

3 / 77

Logit models

Logit models

Fitting logit models

Logit models
Other loglinear models have similar, simpler forms as logit models, where only the relations of the response to the predictors appear in the equivalent logit model. Admit Gender Dept (mutual independence [A][D][G]) log mijk Ljk
D G = + A i + j + k

Logit models: Overview


Fitting procedures
PROC CATMOD, PROC LOGISTIC PROC GENMOD / dist=poisson SPSS: Logistic regression, Loglinear Logit, Generalized Linear Models R: glm(), gnm()

A (A 1 2 ) =

(constant log odds)

Visualization procedures
CATPLOT macro - plot predicted, observed log odds from CATMOD INFLGLIM macro - inuence plots for generalized linear models HALFNORM macro - half-normal plot of residuals for generalized linear models

Admit Gender | Dept, except for Dept. A log mijk Ljk where,
jDept : eect on admissions for department j , (j =1) Gender : 1 df term for eect of gender in Dept. A.

D G AD DG AG + A i + j + k + ij + jk + (j =1) ik

= log(m1jk /m2jk ) = + jDept + (j =1) Gender

SAS craft
All SAS procedures output dataset with obs., tted values, residuals, diagnostics, etc. New model new output dataset Plotting steps remain the same Similar ideas for SPSS, R

5 / 77 Logit models Plots for logit models Logit models Plots for logit models

6 / 77

Plots for logit models


Fit: PROC CATMOD; plot: CATPLOT macro Model: Admit Gender + Dept loglinear [AD] [AG] [DG]
proc catmod order=data data=berkeley; weight freq; response / out=predict; model admit = dept gender / ml; %catplot(data=predict, xc=dept, class=gender, type=FUNCTION, z=1.96, legend=legend1);
Model: logit(Admit) = Dept Gender .90 2

Plots for logit models


Model: logit(Admit) = Dept Gender .90 2

1 Log Odds (Admitted)

.75

.50

Plots observed and predicted on the logit scale (type=FUNCTION) Main eects model parallel proles Probabilities on a separate scale (added below)

Probability (Admitted)

-1

.25

1 Log Odds (Admitted)

.75 Probability (Admitted)

-2 .10

.50

Gender -3

Female Male

.05

-1

.25

C D Department

-2 .10

Gender -3

Female Male

.05

C D Department

7 / 77

8 / 77

Logit models

Plots for logit models

Logit models

Plots for logit models

Logit models: details


Model: Admit Gender + Dept [AD] [AG] [DG]
1 2 3 4 5 6 7

Plots for logit models: Output data set


catberk2.sas

PROC CATMOD output data set: observed & predicted, probabilities & logits
dept A A A A A A B B B B B B ... F F F F F F gender Male Male Male Female Female Female Male Male Male Female Female Female Male Male Male Female Female Female admit Admit Reject Admit Reject Admit Reject Admit Reject Admit Reject Admit Reject _TYPE_ FUNCTION PROB PROB FUNCTION PROB PROB FUNCTION PROB PROB FUNCTION PROB PROB FUNCTION PROB PROB FUNCTION PROB PROB _OBS_ 0.492 0.621 0.379 1.544 0.824 0.176 0.534 0.630 0.370 0.754 0.680 0.320 -2.770 0.059 0.941 -2.581 0.070 0.930 _PRED_ 0.582 0.642 0.358 0.682 0.664 0.336 0.539 0.631 0.369 0.639 0.654 0.346 -2.724 0.062 0.938 -2.625 0.068 0.932 _SEPRED_ 0.069 0.016 0.016 0.099 0.022 0.022 0.086 0.020 0.020 0.116 0.026 0.026 0.158 0.009 0.009 0.158 0.010 0.010

%include catdata(berkeley); proc catmod order=data data=berkeley; weight freq; response / out=predict; model admit = dept gender / ml; run;

PROC CATMOD output: Overall tests and goodness of t


Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------------Intercept 1 262.49 <.0001 dept 5 534.78 <.0001 gender 1 1.53 0.2167 Likelihood Ratio 5 20.20 0.0011

No eect of Gender; big eect of Dept LR test (vs. saturated model): Model doesnt t well Why? How to modify?
9 / 77 Logit models Plots for logit models

This contains both the observed and tted logit values (_TYPE_='FUNCTION') and probabilities (_TYPE_='PROB')
10 / 77 Logit models CATPLOT macro

CATPLOT macro
Plot logit values (_TYPE_='FUNCTION') or probabilities (_TYPE_='PROB') With PSCALE macro, can plot on logit scale, with probability scale on right.

CATPLOT macro
Model: logit(Admit) = Dept Gender .90 2

.75

9 10 11 12 13 14 15 16 17 18 19 20

catberk2.sas %pscale(lo=-4, hi=3, anno=pscale); title 'Model: logit(Admit) = Dept Gender' a=-90 'Probability (Admitted)'; axis1 order=(-3 to 2) offset=(4) label=(a=90 'Log Odds (Admitted)'); axis2 label=('Department') offset=(4); %catplot(data=predict, class=gender, xc=dept, type=FUNCTION, /* plot logit values */ z=1.96, /* show 1.96 x SE -> 95% CI */ anno=pscale); /* add probability scale */

Log Odds (Admitted)

Probability (Admitted)

.50

-1

.25

-2 .10

Gender -3

Female Male

.05

C D Department

11 / 77

no eect of Gender, except in Dept A (Females more likely admitted!)

12 / 77

Logit models

CATPLOT macro

Logit models

CATPLOT macro

Fitting and graphing other models


Change MODEL statement new tted values Plotting step remains the same Admit Gender | Dept, except for Dept. A Admit Dept + j =1 Gender
1 2 3 4 5 6 7 8
1 Log Odds (Admitted)

Fitting and graphing other models: details


Model: Admit Gender | Dept, except for Dept. A
catberk6.sas %include catdata(berkeley); data berkeley; set berkeley; *-- Dummy variable for Gender in Dept A; dept1AG = (gender='F') * (dept=1); format dept dept.; proc catmod order=data data=berkeley; weight freq; population dept gender; direct dept1AG; response / out=predict; model admit = dept dept1AG / ml; run; ...

proc catmod order=data data=berkeley; response / out=predict; model admit = dept dept1AG / ml; %catplot(data=predict, xc=dept, class=gender, type=FUNCTION, z=1.96, legend=legend1);
logit(Admit) = Dept DeptA*Gender
2

Need to dene a dummy variable for eect of Gender in Dept. A

9 10 11 12 13 14 15 16
Gender

-1

-2

-3

Female Male

C D Department

13 / 77 Logit models CATPLOT macro Logit models CATPLOT macro

14 / 77

Fitting and graphing other models:details


PROC CATMOD output:
Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------------Intercept 1 291.22 <.0001 dept 5 571.45 <.0001 dept1AG 1 16.04 <.0001 Likelihood Ratio 5 2.68 0.7489
17 18 19 20 21

Fitting and graphing other models: details


PROC CATMOD: observed and predicted logits:
catberk6.sas proc print data=predict; id dept gender; var _obs_ _pred_ _sepred_; format _numeric_ 6.3 dept dept.; where(_type_='FUNCTION'); dept A A B B C C D D E E F F gender M F M F M F M F M F M F _OBS_ 0.492 1.544 0.534 0.754 -0.536 -0.660 -0.704 -0.622 -0.957 -1.157 -2.770 -2.581 _PRED_ 0.492 1.544 0.543 0.543 -0.616 -0.616 -0.665 -0.665 -1.090 -1.090 -2.676 -2.676 _SEPRED_ 0.072 0.253 0.086 0.086 0.069 0.069 0.075 0.075 0.095 0.095 0.152 0.152

Analysis of Maximum Likelihood Estimates Standard ChiParameter Estimate Error Square Pr > ChiSq -------------------------------------------------------Intercept -0.6685 0.0392 291.22 <.0001 dept A 1.1606 0.0705 271.21 <.0001 B 1.2113 0.0802 227.95 <.0001 C 0.0528 0.0687 0.59 0.4426 D 0.00358 0.0727 0.00 0.9607 E -0.4210 0.0871 23.34 <.0001 dept1AG 1.0521 0.2627 16.04 <.0001

Fits well! How to interpret?


15 / 77 16 / 77

Logit models

CATPLOT macro

Logit models

Diagnostic plots for GLMs

Fitting and graphing other models: details


22 23 24 25

catberk6.sas title 'logit(Admit) = Dept DeptA*Gender'; %catplot(data=predict, x=dept, class=gender, type=FUNCTION, /* plot the log odds */ z=1.96); /* 95% error bars */

Diagnostic plots for Generalized Linear Models


INFLGLIM macro: Inuence plots for generalized linear models (Williams, 1987) Fit: PROC GENMOD; calculates additional diagnostic measures (Hat value, Cooks D, etc.) Plot: measures of residual (GY=2 , 2 residual) vs. leverage (GX=hat value), bubble size (area, radius) Cooks D. which cells have undue impact on tted model?

logit(Admit) = Dept DeptA*Gender


2

1 Log Odds (Admitted)

-1

-2

Gender -3

Female Male

C D Department

17 / 77 Logit models Diagnostic plots for GLMs Logit models Diagnostic plots for GLMs

18 / 77

INFLGLIM macro: Example


Berkeley data, model [AD ][GD ] Lij = + jDept
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

INFLGLIM macro: Example

genberk1.sas %include catdata(berkeley); *-- make a cell ID variable, joining factors; data berkeley; set berkeley; cell = trim(put(dept,dept.)) || gender || trim(put(admit,yn.)); %inflglim(data=berkeley, class=dept gender admit, resp=freq, model=admit|dept gender|dept, dist=poisson, id=cell, gx=hat, gy=streschi);

19 / 77

All cells which do not t (|ri | > 2) are for department A. Males applying to dept A have large leverage large inuence (Cooks D)

20 / 77

Logit models

Diagnostic plots for GLMs

Logit models

Diagnostic plots for GLMs

Inuence plots in R
The influencePlot() function in the car package gives similar plots:
1 2 3 4 5

Diagnostic plots for Generalized Linear Models

berkeley-diag.R berkeley <- as.data.frame(UCBAdmissions) ... berk.mod <- glm(Freq ~ Dept * (Gender+Admit), data=berkeley, family="poisson") influencePlot(berk.mod, id.n=3, id.col="red")
4 AFAdm AMRej

HALFNORM macro: Half-normal plot of residuals (Atkinson, 1981) Plot ordered absolute residuals, |r |(i ) vs. expected normal values, |z |(i ) Standard normal condence envelope not suitable for GLMs Simulate reference line and envelope with simulated condence intervals
1 2

Studentized Residuals

FMRej BMRej 0 BMAdm

3 4 5

genberk1.sas %halfnorm(data=berkeley, class=dept gender admit, resp=freq, model=dept|gender dept|admit, dist=poisson, id=cell);

AFRej 0.4 0.5 0.6 0.7 HatValues 0.8

AMAdm

0.9

1.0

21 / 77 Logit models Diagnostic plots for GLMs Logistic regression models

22 / 77

5 AFAbsolute Std Deviance Residual 4 AM-AM+ AF+

Logistic regression models


Response variable

Binary response: success/failure, vote: yes/no Binomial data: x successes in n trials (grouped data) Ordinal response: none < some < severe depression Polytomous response: vote Liberal, Tory, NDP, Green

Explanatory variables
1 EF+

0 0 1 2 3 Expected value of half normal quantile

Quantitative regressors: age, dose Transformed regressors: age, log(dose) Polynomial regressors: age2 , age3 , Categorical predictors: treatment, sex Interaction regessors: treatment age, sex age

Points with largest |residual| labeled The model ts well, except in department A.
23 / 77 24 / 77

Logistic regression models

Binary response

Logistic regression models

Binary response

Logistic regression models: Binary response


For a binary response, Y (0, 1), want to predict = Pr(Y = 1 | x ) Linear regression will give predicted values outside 0 1 Logistic model:
logit(i ) log[/(1 )] avoids this problem logit is interpretable as log odds that Y = 1

Logistic regression models: Binary response


Quantitative predictor: Linear and Logit regression on age Except in extremes, linear and logistic models give similar predicted values
Arthritis treatment data Linear and Logit Regressions on Age 1.0

Probability (Improved)

Probit (normal transform) model similar predictions, but is less interpretable


1.0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 20 30 40 50 AGE 60 70 80

Linear

.75

Logistic Normal

Probability

.50

.25

.00 -3 -2 -1 0 Predictor 1 2 3

25 / 77 Logistic regression models Binary response Logistic regression models Fitting logistic models

26 / 77

Logistic regression models: Binary response


For a binary response, Y (0, 1), let x be a vector of p regressors, and i be the probability, Pr(Y = 1 | x). The logistic regression model is a linear model for the log odds , or logit that Y = 1, given the values in x, logit(i ) log i 1 i = + xT i = + 1 xi 1 + 2 xi 2 + + p xip

Logistic regression models: Binary response


Fitting
PROC LOGISTIC (or ROBUST macro M-estimation) Data:
Frequency form (from PROC FREQ) when all predictors are discrete Case form when any predictors are quantitative

Models:
CLASS statement (V7+) no need for dummy variables
discrete predictors can specify order and parameterization (eect, polynomial, reference cell)

An equivalent (non-linear) form of the model may be specied for the probability, i , itself, i = {1 + exp([ + xT i ])}
1

MODEL statement allows GLM syntax, e.g., proc logistic; class Sex Treat; model Better = Sex | Treat | Age @2; Better = Sex Treat Age Sex*Treat Sex*Age Treat*Age

so, increasing xij by 1 increases logit(i ) by j , and multiplies the odds by e j .


27 / 77 28 / 77

The logistic model is a linear model for the log odds, but also a multiplicative model for the odds of success, i T = exp( + xT i ) = exp() exp(xi ) 1 i

Logistic regression models

Visualizing logistic models

Logistic regression models

Visualizing logistic models

Logistic regression models: Binary response


Visualization
Goal: see and understand the data and tted model LOGODDS macro: Plot observed responses, tted and smoothed probabilities Model plots:
OUTPUT statement
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Example: Arthritis treatment data


Predictors: Sex, Treatment (treated, placebo), Age Response: improvement (none, some, marked)
Consider rst as binary response: None vs. (Some or Marked)=Better arthrit.sas data arthrit; length treat $7. sex $6. ; input id treat $ sex $ age improve @@ ; case = _n_; better = (improve > 0); *-- Make binary datalines ; 57 Treated Male 27 1 9 Placebo Male 37 46 Treated Male 29 0 14 Placebo Male 44 77 Treated Male 30 0 73 Placebo Male 50 ... (observations omitted ) 56 Treated Female 69 1 42 Placebo Female 66 43 Treated Female 70 1 15 Placebo Female 66 71 Placebo Female 68 1 Placebo Female 74 ;

Data in case form:

Plot with standard procedures (PROC GCHART, GPLOT) Utility macros (BARS, LABEL, POINTS, PSCALE, etc.) for custom displays

tted i , lower/upper (1 ) CI, and/or tted logit, ( + xT i ) z1/2 se (logit)

response; 0 0 0 0 1 1 2

Eect plots plot hierarchical subset of eects, averaging over those not included. INFLOGIS macro: Inuence plots for logistic regression models ADDVAR macro: Added variable plots for new predictors or transformations of old

29 / 77 Logistic regression models Empirical logit plots Logistic regression models Empirical logit plots

30 / 77

LOGODDS macro: Empirical logit plots


Problems with visualizing discrete outcomes:
Log Odds Better=1

Linearity: Is a linear relation realistic? Smoothing: Discrete data often requires smoothing to see!
The LOGODDS macro: Show the data: Plot (0/1) responses [stacked or jittered]
yi +1/2 Divide X into groups (e.g., deciles), emprical logit, log ni yi +1/2 , for each Linear logistic regression, plus smoothed curve (LOWESS macro)

-1

1 2 3 4 5

%include catdata(arthrit); %logodds(data=arthrit, x=age, y=Better, /* vars to plot */ smooth=0.5, /* LOWESS smoothing parameter */ plot=logit); /* plot on logit scale */

-2

-3 20
31 / 77

30

40

50 AGE

60

70

80
32 / 77

Logistic regression models

Empirical logit plots

Logistic regression models

PROC LOGISTIC: Fitting and plotting

Smoothing the binary observations


Can also use direct smoothing:
Arthritis data: linear logistic and lowess smooth
1.0

PROC LOGISTIC: Model tting and plotting


Specify ordering of response levels (order= or descending options) Specify parameterizations for CLASS variables OUTPUT statement to get tted logits and probabilities
1

glogist1c.sas proc logistic data=arthrit descending; class sex (ref=last) treat (ref=first) / param=ref; model better = sex treat age; output out=results p=prob l=lower u=upper xbeta=logit stdxbeta=selogit / alpha=.33;

Prob (Better)

0.8

0.6

2 3 4 5 6 7

0.2

0.4

0.0

The output includes:


30 40 50 Age 60 70

Type III Analysis of Effects Effect DF 1 1 1 Wald Chi-Square 6.2576 10.7596 5.5655 Pr > ChiSq 0.0124 0.0010 0.0183
34 / 77 Logistic regression models PROC LOGISTIC: Fitting and plotting

SAS: PROC LOESS, lowess macro; R: lowess() There is a hint that the relation may be non-linear But data is thin at the extremes
33 / 77 Logistic regression models PROC LOGISTIC: Fitting and plotting

sex treat age

Analysis of Maximum Likelihood Estimates Parameter Intercept sex Female treat Treated age DF 1 1 1 1 Estimate -4.5033 1.4878 1.7598 0.0487 Standard Error 1.3074 0.5948 0.5365 0.0207 Wald Chi-Square 11.8649 6.2576 10.7596 5.5655 Pr > ChiSq 0.0006 0.0124 0.0010 0.0183

PROC LOGISTIC: Full-model plots


Full-model plots display the tted (predicted) values over all combinations ofpredictors: Plot tted values from the dataset specied on the OUTPUT statement Plot either predicted probabilities or logits Condence intervals or standard errors allow showing error bars The rst few observations from the results dataset:
id sex 57 Male 9 Male 46 Male 14 Male 77 Male 73 Male ... treat Treated Placebo Treated Placebo Treated Placebo age better 27 37 29 44 30 50 1 0 0 0 0 0 prob 0.194 0.063 0.209 0.086 0.217 0.112 lower 0.103 0.032 0.115 0.047 0.122 0.065 upper 0.334 0.120 0.350 0.152 0.357 0.188 logit selogit -1.427 -2.700 -1.330 -2.358 -1.281 -2.066 0.758 0.725 0.728 0.658 0.713 0.622

Odds Ratio Estimates Effect sex Female vs Male treat Treated vs Placebo age Point Estimate 4.427 5.811 1.050 95% Wald Confidence Limits 1.380 2.031 1.008 14.204 16.632 1.093

Parameter estimates (reference cell coding): 1 = 1.49 Females e 1.49 =4.43 more likely better than Males 2 = 1.76 Treated e 1.76 =5.81 more likely better than Placebo 3 = 0.0487 odds ratio=1.05 odds of improvement increase 5% each year. Over 10 years, odds of improvement = e 100.0486 = 1.63, a 63% increase.
35 / 77

prob predicted probabilities, with CI (lower ,upper ) logit predicted logit, with standard error selogit
36 / 77

Logistic regression models

PROC LOGISTIC: Fitting and plotting

Logistic regression models

PROC LOGISTIC: Fitting and plotting

PROC LOGISTIC: Full-model plots


Basic plots: Plot either logit or probability vs. one predictor (continuous or most levels) Separate curves for one factor (= factor) Separate panels for all others (BY statement)
1 2 3 4 5

PROC LOGISTIC: Model plots


Enhanced plots: Plot on logit scale, with probability scale at right (PSCALE macro) Show 67% error bars 1 se (BARS macro) Custom legend and panel labels (LABEL macro)
3 .95 3 .95

Female
2 Treated .90 2

Male
.90

Log Odds Improved

.70 Placebo .60 .50 .40

Log Odds Improved

proc gplot data=results; plot (logit prob) * age = treat; by sex; symbol1 v=circle i=join l=3 c=black; symbol2 v=dot i=join l=1 c=red;

/* /* /* /*

separate curves */ separate panels */ placebo */ treated */

.80 1 1 Treated 0 Probability Improved

.80 .70 .60 .50 .40 -1 Placebo -2 .30 .20 Probability Improved

SYMBOL statement dene the point value (v=), interpolate option (i=), line style (l=), color (c=), etc.

-1

.30 .20

-2

.10

.10

-3 20 30 40 50 Age
37 / 77 Logistic regression models PROC LOGISTIC: Fitting and plotting

.05 60 70 80

-3 20 30 40 50 Age 60 70 80

.05

38 / 77 Logistic regression models PROC LOGISTIC: Fitting and plotting

PROC LOGISTIC: Full-model plots


Enhanced plots:
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

glogist1c.sas *-- Error bars, on logit scale; %bars(data=results, var=logit, class=age, cvar=treat, by=age, barlen=selogit, out=bars); *-- Custom legends and panel labels; %label(data=results, y=logit, x=age, xoff=1, cvar=treat, by=sex, subset=last.treat, out=label1, pos=6, text=treat); %label(data=results, y=2.5, x=20, size=2, by=sex, subset=first.sex, out=label2, pos=6, text=sex); *-- Probability scales at right; %pscale(out=pscale, byvar=sex, byval=%str('Female','Male'));

title ' ' h=1.8 a=-90 'Probability Improved' /* right axis label */ h=2.5 a=-90 ' '; /* extra space */ goptions hby=0; /* suppress BY values */ proc gplot data=results; plot logit * age = treat / vaxis=axis1 haxis=axis2 hm=1 vm=1 nolegend anno=bars frame; by sex; axis1 label=(a=90 'Log Odds Improved') order=(-3 to 3); axis2 order=(20 to 80 by 10) offset=(2,6); symbol1 v=+ i=join l=3 c=black; symbol2 v=- i=join l=1 c=red; label age='Age'; run;
3 .95 3 .95

glogist1c.sas

Female
2 Treated .90 2

Male
.90

.80 Log Odds Improved .70 Placebo 0 .60 .50 .40 -1 .30 .20 -2 -2 Log Odds Improved 1 1 Treated 0 Probability Improved

.80 .70 .60 .50 .40 -1 Placebo .30 .20 Probability Improved

*-- Join ANNOTATE datasets; data bars; set label1 label2 bars pscale; proc sort; by sex;

.10

.10

-3 20 30 40 50 Age 60 70 80

.05

-3 20 30 40 50 Age 60 70 80

.05

39 / 77

40 / 77

Logistic regression models

PROC LOGISTIC: Fitting and plotting

Eect plots

General ideas

Models with interactions


Plotting tted values
Only need to change the MODEL statement Output dataset automatically incorporates all model terms Plotting steps remain exactly the same
1 2 3 4 5

Eect plots: basic ideas


Show a given eect (and low-order relatives) controlling for other model eects.

proc logistic data=arthrit descending; class sex (ref=last) treat (ref=first) / param=ref; model better = treat sex | age @2;; output out=results p=prob l=lower u=upper xbeta=logit stdxbeta=selogit / alpha=.33;

41 / 77 Eect plots General ideas Eect plots Eect plots software

42 / 77

Eect plots for generalized linear models: Details


For simple models, full model plots show the complete relation between response and all predictors . Fox (1987) For complex models, often wish to plot a specic main eect or interaction (including lower-order relatives) controlling for other eects
Fit full model to data with linear predictor (e.g., logit) = X and link function g () = estimate b of and covariance matrix V (b) of b. Vary each predictor in the term over its range Fix other predictors at typical values (mean, median, proportion in the data) eect model matrix, X Calculate tted eect values, = X b. Standard errors are square roots of diag(X V (b)X T ) Plot , or values transformed back to scale of response, g 1 ( ).

Eect plots software


General method
Create a grid of values for predictors in the eect (EXPGRID macro) Fix other predictors at typical values (mean, median, proportion in the data) Concatenate grid with data Fit model output data set tted values in the grid Standard errors automatically calculated Plot tted values in the grid

EFFPLOT macro
Works with PROC REG, PROC GLM, PROC LOGISTIC, PROC GENMOD Uses MEANPLOT macro to do the plotting Some limitations cant plot correct standard errors

SAS 9.3 ODS Graphics


Several procedures now do eects-like plots: LOGISTIC, GLM, GLIMMIX Easy; PROC LOGISTIC quite exible

Note : This provides a general means to visualize interactions in all linear and generalized linear models.

R: eects package
Most general: Handles linear models (lm()), generalized linear models (glm()), multinomial (multinom()) and proportional-odds (polr()) models. allEffects(model) calculates eects for all high-order terms in model plot(allEffects(model)) plots them
44 / 77

43 / 77

Eect plots

Eect plots software

Eect plots

Eect plots software

Eect plots: Example


Cowles and Davis (1987) Volunteering for a psychology experiment
Predictors: Sex, Neuroticism, Extraversion strong interaction, Neuroticism Extraversion
1 2 3 4 5 6 7

Eect plots: SAS 9.3 ODS Graphics


cowles-logistic-eff.sas proc logistic data=cowles outest=parm descending ; class Sex; model Volunteer = Sex Extraver | Neurot / lackfit ; effectplot slicefit(x=Extraver sliceby=Neurot) / at(sex=1.5) noobs; effectplot slicefit(x=Neurot sliceby=Extraver) / at(sex=1.5) noobs; effectplot contour(x=Neurot y=Extraver) / at(sex=1.5) noobs; run;

45 / 77 Eect plots Eect plots software Eect plots Eect plots software

46 / 77

Eect plots: SAS 9.3 ODS Graphics


1 2 3 4 5

SAS 9.2: ODS Graphics


1 2 3 4 5 6 7 8

cowles-logistic-eff.sas proc logistic data=cowles outest=parm descending ; class Sex; model Volunteer = Sex Extraver | Neurot / lackfit ; effectplot contour(x=Neurot y=Extraver) / at(sex=1.5) noobs; run;

arthritis-logistic-ods.sas %include catdata(arthrit); ods graphics on; proc logistic data=arthrit descending plots(only)=(effect(plotby=sex sliceby=treat showobs clband alpha=0.33)); class sex (ref=last) treat (ref=first) / param=ref; model better = sex treat age / clodds=wald; run; ods graphics off;

47 / 77

48 / 77

Eect plots

The eects package in R

Eect plots

The eects package in R

Eect plots with the effects package in R


> > > + > library(effects) ## load the effects package data(Cowles) mod.cowles <- glm(volunteer ~ sex + neuroticism*extraversion, data=Cowles, family=binomial) summary(mod.cowles)

Eect plots with the effects package in R


Calculate eects for all model terms, plot neuro:extra:
> eff.cowles <- allEffects(mod.cowles, + xlevels=list(neuroticism=0:24, + extraversion=seq(0, 24, 8))) > > plot(eff.cowles, 'neuroticism:extraversion', ylab="Prob(Volunteer)", + ticks=list(at=c(.1,.25,.5,.75,.9)), layout=c(4,1), aspect=1)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.358207 0.501320 -4.704 2.55e-06 sexmale -0.247152 0.111631 -2.214 0.02683 neuroticism 0.110777 0.037648 2.942 0.00326 extraversion 0.166816 0.037719 4.423 9.75e-06 neuroticism:extraversion -0.008552 0.002934 -2.915 0.00355 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1933.5 Residual deviance: 1897.4 AIC: 1907.4 on 1420 on 1416 degrees of freedom degrees of freedom *** * ** *** **

neuroticism*extraversion effect plot


0 5 10 15 20 25 0 5 10 15 20 25

extraversion Prob(Volunteer)
0.9 0.75 0.5 0.25 0.1

extraversion

extraversion

extraversion

10 15 20 25

10 15 20 25

neuroticism
49 / 77 Eect plots Arrests Eect plots Arrests 50 / 77

Extended example: Arrests for Marihuana Possession


Context & background

Extended example: Arrests for Marihuana Possession


Data

Control variables: In Dec. 2002, the Toronto Star examined the issue of racial proling, by analyzing a data base of 600,000+ arrest records from 1996-2002. They focused on a subset of arrests for which police action was discretionary, e.g., simple possession of small quantities of marijuana, where the police could:
Release the arrestee with a summons like a parking ticket Bring to police station, hold for bail, etc. harsher treatment

year, age, sex employed, citizen Yes, No checks Number of police data bases (previous arrests, previous convictions, parole status, etc.) in which the arrestees name was found.
1 2 3 1 2 3 4 5 6 7 8 9 10 11

> library(effects) > data(Arrests) > some(Arrests) 915 1568 2981 3381 3516 4128 4142 4634 4732 5183 released colour year age sex employed citizen checks No Black 2001 35 Male Yes Yes 4 Yes White 2002 21 Male Yes Yes 0 Yes White 2000 23 Male Yes Yes 2 Yes Black 1998 23 Male No Yes 2 Yes White 2002 22 Male Yes Yes 0 No White 2001 29 Male Yes Yes 1 Yes Black 1998 23 Male Yes Yes 3 Yes White 2001 18 Male Yes Yes 0 Yes White 1999 21 Male Yes Yes 3 Yes White 1999 19 Male Yes Yes 0

Response variable: released Yes, No Main predictor of interest: skin-colour of arrestee (black, white)

51 / 77

52 / 77

Eect plots

Arrests

Eect plots

Arrests

Extended example: Arrests for Marihuana Possession


Model

Eect plots: colour


Evidence for dierent treatment of blacks and whites ( racial proling ), controlling (adjusting) for other factors
1

To allow possibly non-linear eects of year, we treat it as a factor:


1

> Arrests$year <- as.factor(Arrests$year)

> plot(effect("colour", arrests.mod), multiline = FALSE, ylab = "Probability(released)"


colour effect plot
0.88

Logistic regression model with all main eects, plus interactions of colour:year and colour:age
1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14

> arrests.mod <- glm(released ~ employed + citizen + checks + colour * + year + colour * age, family = binomial, data = Arrests) > Anova(arrests.mod)
Probability(released)

q
0.86

Analysis of Deviance Table (Type II tests) Response: released LR Chisq Df Pr(>Chisq) employed 72.673 1 < 2.2e-16 *** citizen 25.783 1 3.820e-07 *** checks 205.211 1 < 2.2e-16 *** colour 19.572 1 9.687e-06 *** year 6.087 5 0.2978477 age 0.459 1 0.4982736 colour:year 21.720 5 0.0005917 *** colour:age 13.886 1 0.0001942 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
53 / 77 Eect plots Arrests

0.84

0.82

0.8

Black

White

colour
54 / 77 Eect plots Arrests

Eect plots: Interactions


The story turned out to be more nuanced than reported by the Toronto Star , as shown in eect plots for interactions with colour.
1

Eect plots: Interactions


The story turned out to be more nuanced than reported by the Toronto Star , as shown in eect plots for interactions with colour.
1

> plot(effect("colour:year", arrests.mod), multiline = TRUE, ...)


colour*year effect plot

> plot(effect("colour:age", arrests.mod), multiline = TRUE, ...)


colour*age effect plot

0.88

0.86

Probability(released)

q
0.84

Probability(released)

Up to 2000, strong evidence for dierential treatment of blacks and whites Also evidence to support Police claim of eect of training to reduce racial eects in treatment

0.9

Opposite age eects for blacks and whites: Young blacks treated more harshly than young whites Older blacks treated less harshly than older whites

0.82

q q

0.85

0.8

0.78

colour Black White


q

0.8

0.76

colour Black White

q
1997 1998 1999 2000 2001 2002 10 20 30 40 50 60

year
55 / 77

age
56 / 77

Eect plots

Arrests

Eect plots

Arrests

Eect plots: allEects


All model eects can be viewed together using plot(allEffects(mod))
1 2 3 1 2 3 4 5 6

Eect plots: SAS


Arrests-logistic.sas proc logistic data=arrests descending; class colour year sex citizen employed; model released = colour|year colour|age sex employed citizen checks; effectplot interaction (x=year sliceby=colour) / clm alpha=0.33 noobs; effectplot slicefit (x=age sliceby=colour) / clm alpha=0.33 obs(fringe jitter); run;

> arrests.effects <- allEffects(arrests.mod, xlevels = list(age = seq(15, + 45, 5))) > plot(arrests.effects, ylab = "Probability(released)", ask = FALSE)
employed effect plot
Probability(released) Probability(released)
0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.88 0.86 0.84 0.82 0.8 0.78 0.76

citizen effect plot


q
Probability(released)

checks effect plot

0.9

0.8 0.7 0.6 0.5 0 1 2 3 4 5 6

q
No Yes

No

Yes

employed

citizen

checks

colour*year effect plot


1997 1998 1999 2000 2001 2002

colour*age effect plot


15 20 25 30 35 40 45

Probability(released)

0.9 0.85 0.8 0.75 0.7 1997 1998 1999 2000 2001 2002

q q qq q q q q

q q q q

Probability(released)

colour : Black

colour : White

colour : Black
0.9 0.85 0.8 0.75 15 20 25 30 35 40 45

colour : White

year

age

NB: These plots are computed at average levels of quantitative variables, but at reference levels of class variables: Sex=Male, citizen=Yes, employed=Yes
57 / 77 58 / 77 Inuence measures and diagnostic plots

Inuence measures and diagnostic plots

Inuence measures and diagnostic plots


centroid in space of predictors

Inuence measures and diagnostic plots


PROC LOGISTIC: printed output with the influence option
1 2

Leverage: Potential impact of an individual case distance from the Residuals: Which observations are poorly tted? Inuence: Actual impact of an individual case leverage residual

proc logistic data=arthrit descending; model better = sex treat age / influence;

C, CBAR analogs of Cooks D in OLS standardized change in regression coecients when i -th case is deleted. DIFCHISQ, DIFDEV 2 when i -th case is deleted.
6uvvrhrqhh 7iiyrvr)Dsyrpr8rssvpvr8 (  ' & ' & 8uhtrvQrh8uvThr % $  # " ! (  6uvvrhrqhh 7iiyrvr)Dsyrpr8rssvpvr8

8uhtrvQrh8uvThr

% $ # " !    

 



   ! " # $ @vhrqQihivyv % & ' (

 " # $ % & ' (   GrrhtrChhyr   !  "  #  $

Too much output, doesnt highlight unusual cases, ...


59 / 77 60 / 77

Inuence measures and diagnostic plots

Inuence measures and diagnostic plots

Inuence measures and diagnostic plots


PROC LOGISTIC: plotting diagnostic measures with the plots option
1 2 3 4 5

Inuence measures and diagnostic plots: Inuence plots


The option plots(label)=dpc gives plots of 2 (DIFCHISQ, DIFDEV) vs. p Points are colored according to the inuence measure C.

proc logistic data=arthrit descending plots(only label)=(leverage dpc); class sex (ref=last) treat (ref=first) / param=ref; model better = sex treat age ; run;

The two bands of points correspond to better = {0, 1}


61 / 77 Inuence measures and diagnostic plots INFLOGIS macro Inuence measures and diagnostic plots INFLOGIS macro 62 / 77

INFLOGIS macro
Specialized version of INFLGLIM macro for logistic regression Plots a measure of change in 2 (DIFCHISQ or DIFDEV) vs. predicted probability or leverage. Bubble symbols show actual inuence (C or CBAR) Shows standard cutos for large values Flexible labeling of unusual cases
$UWKUL GDWD 7iiyrv r)DWLsyVWUHDWPHQW rpr8rssv pvr8 ( ' & % $ # " !    ! " # $ % @vhrqQihivyv & ' (  8uhtrvQrh8uvThr (  ' & % $  # " !  " # $ % & ' (     !  "  #  $ GrrhtrChhyr
63 / 77 1 2 3 4 5 6 7

INFLOGIS macro: Example


logist1b.sas %include data(arthrit); %inflogis(data=arthrit, class=sex treat, y=better, x=sex treat age, id=case, gy=DIFCHISQ, gx=PRED HAT, loptions=descending); /* /* /* /* /* /* CLASS variables response predictors case ID graph ordinate graph abscissas */ */ */ */ */ */

$UWKUL GDWD 7iiyrv r)DWLsyVWUHDWPHQW rpr8rssv pvr8

8 9

8uhtrvQrh8uvThr

Printed output lists cases with large leverage, residual or inuence:


case better sex Male Male Female Female Female Female treat Treated Placebo Placebo Placebo Treated Treated age pred 27 63 31 33 58 69 .806 .807 .818 .803 .172 .108 hat difchisq difdev .09 .06 .05 .05 .03 .03 4.578 4.460 4.749 4.296 4.970 8.498 3.695 3.565 3.657 3.464 3.676 4.712 c 0.451 0.290 0.261 0.224 0.160 0.276

  



  

1 22 30 34 55 77

1 1 1 1 0 0

64 / 77

Inuence measures and diagnostic plots

INFLOGIS macro

Inuence measures and diagnostic plots

INFLOGIS macro

INFLOGIS macro: Example


6uvvrhrqhh 7iiyrvr)Dsyrpr8rssvpvr8 (  ' & 8uhtrvQrh8uvThr % $ # " !    

INFLOGIS macro: Example


6uvvrhrqhh 7iiyrvr)Dsyrpr8rssvpvr8 (  ' & 8uhtrvQrh8uvThr % $  # " !

 



   ! " # $ @vhrqQihivyv % & ' (
65 / 77 Inuence measures and diagnostic plots Diagnostic plots in R

 " # $ % & ' (   GrrhtrChhyr   !  "  #  $


66 / 77 Inuence measures and diagnostic plots Diagnostic plots in R

Diagnostic plots in R
In R, plotting a glm object gives the regression quartet
arth.mod1 <- glm(Better ~ Age+Sex+Treatment,data=Arthritis, family='binomial') plot(arth.mod1)
Residuals vs Fitted
2 2
56

Diagnostic plots in R
library(car) influencePlot(arth.mod1)
Arthritis data: influencePlot
2 56 58 52 1 4

1.5

39

Studentized Residuals
0.5

Std. deviance resid.

Std. deviance resid.

Std. Pearson resid.

1.0

28

52 1 4

Residuals

0.5

28

39

28 39

0.0

Cooks distance 0.00 0.04 0.08 0.12

Normal QQ

ScaleLocation

Residuals vs Leverage

39 0.04 0.06 0.08 0.10 0.12 0.14

HatValues
67 / 77 68 / 77

Inuence measures and diagnostic plots

The Donner Party

Inuence measures and diagnostic plots

The Donner Party

Donner Party: A graphic tale of survival & inuence


History: AprMay, 1846: Donner/Reed families set out from Springeld, IL to CA Jul: Bridgers Fort, WY, 87 people, 23 wagons

Donner Party: A graphic tale of survival & inuence


History: Hastings Cuto , untried route through Salt Lake Desert, Wasatch Mtns. (90 people) Worst recorded winter: Oct 31 blizzard Missed by 1 day, stranded at Truckee Lake (now Donners Lake, Reno)
Rescue parties sent out ( Dire necessity , Forelorn hope , ...) Relief parties from CA: 42 survivors (MarApr, 47)

69 / 77 Inuence measures and diagnostic plots The Donner Party Inuence measures and diagnostic plots The Donner Party

70 / 77

The Donner Party: Who lived and died?


Other analyses, e.g., (Ramsay and Schafer, 1997):
Log Odds (survive) linear with Age Odds (survive | Women / survive | Men) = 4.9 (Ignored children) NAME Antoine Breen, Edward Breen, Margaret I. Breen, James Breen, John Breen, Mary Breen, Patrick Breen, Patrick Jr. Breen, Peter Breen, Simon Burger, Charles Denton, John Dolan, Patrick Donner, Elitha Cumi Donner, Eliza Poor Donner, Elizabeth Donner, Francis E. Donner, George Donner, George Jr. ... AGE 23 13 1 5 14 40 51 9 3 8 30 28 40 13 3 45 6 62 9 MALE 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 SURVIVED 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 DEATH 29DEC46 . . . . . . . . . 27DEC46 26FEB47 27DEC46 . . 14MAR47 . 18MAR47 .
71 / 77 1

Empirical logit plots


Is a linear logistic model satisfactory for these data? Discrete data often requires smoothing to see!
%logodds(data=donner, y=Died, x=Age, smooth=0.5);
1.0

0.8

Probability Died=1

0.6

0.4

0.2

0.0 0 10 20 30 Age 40 50 60 70

relation with Age is quadratic: youngest and oldest most likely to perish.

72 / 77

Inuence measures and diagnostic plots

The Donner Party

Inuence measures and diagnostic plots

The Donner Party

Quadratic model?
Fit: Pr(Death) Age + Age + Male Statistical evidence for Age2 equivocal:
Wald 2 (1) = 2.84, p = 0.09; but 2 LR G(1) = 4.40, p = 0.03. ... Analysis of Maximum Likelihood Estimates Parameter Variable Estimate INTERCPT AGE AGE2 MALE -1.7721 0.0168 0.00208 1.3745 Standard Wald Error Chi-Square 0.5673 0.0184 0.00123 0.5066 9.7588 0.8355 2.8439 7.3617 Pr > Chi-Square 0.0018 0.3607 0.0917 0.0067
2

Quadratic model?
Visual evidence is persuasive (but the data are thin at older ages)
1.0

0.8 Probability of Death

0.6

Men

0.4

Women 0.2

Males: exp(1.3745) = 3.95 times as likely to die, controlling for Age, Age2

0.0 0 10 20 30 40 Age 50 60 70

73 / 77 Inuence measures and diagnostic plots The Donner Party Inuence measures and diagnostic plots The Donner Party

74 / 77

Who was inuential?

Why are they inuential?


NAME Died Age M? PRED StuRes 0 0 1 1 1 51 46 45 44 47 1 1 0 0 0 .921 -2.365 .856 -2.054 .571 1.139 .541 1.183 .630 1.050 Hat DifDev .09 .08 .14 .12 .16 6.25 4.40 1.24 1.35 1.04 C 1.294 0.575 0.136 0.135 0.137

Breen, Patrick Reed, James Donner, Elizabeth Donner, Tamsen Graves, Elizabeth

Patrick Breen, James Reed: Older men who survived Elizabeth & Tamsen Donner, Elizabeth Graves: Older women who survived Moral lessons of this story:
Dont try to cross the Donner Pass in late October; if you do, bring food Plots of tted models show only what is included in the model Discrete data often need smoothing (or non-linear terms) to see the pattern Always examine model diagnostics preferably graphic

75 / 77

76 / 77

Summary: Part 4

Summary: Part 4
Logit models
Analogous to ANOVA models for a binary response Equivalent to loglinear model, including interaction of all predictors Fitting: SAS: PROC CATMOD, PROC LOGISTIC; R: glm() Visualization: plot tted logits (or probabilties) vs. factors (CATPLOT macro)

Logistic regression
Analogous to regression models for a binary response Coecients: increment to log odds / X ; exp multiplier of odds per X Discrete responses: smoothing often useful Visualization: plot tted logits (or probabilties) vs. predictors

Eect plots
Plot a main eect or interaction in the context of a more complex model Shows that eect controlling for (averaged over) all other model eects SAS: EFFPLOT macro; R: effects package

Inuence & diagnostics


Inuence plots highlight unusual cases/cells large impact on tted model Probability plots of residuals help to check model assumptions SAS: INFLGLIM macro, HALFNORM macro; R: plot(my.glm), influencePlot(my.glm)
77 / 77

You might also like