Professional Documents
Culture Documents
)ages 22*-22+.
,ey -eatures
!t a very basic level" the relationship between a continuous response variable (#) and a continuous e planatory variable ($) may be represented using a line o% best-%it" where # is predicted" at least to some e tent" by $. &% this relationship is linear" it may be appropriately represented mathematically using the straight line equation '# ( ) * + '" as shown in ,igure (this line was computed using the least-squares procedure. see /yan" -001). The relationship between variables # and $ is described using the equation o% the line o% best %it with ) indicating the value o% # when $ is equal to zero (also 2nown as the intercept) and + indicating the slope o% the line (also 2nown as the regression coe%%icient). The regression coe%%icient + describes the change in # that is associated with a unit change in $. !s can be seen %rom ,igure -" + only provides an indication o% the average e pected change (the observed data are scattered around the line)" ma2ing it important to also interpret the con%idence intervals %or the estimate (the large sample 034 two-tailed appro imation o% the con%idence intervals can be calculated as + 5 -.06 s.e. +). &n addition to the model parameters and con%idence intervals %or +" it is use%ul to also have an indication o% how well the model %its the data. 7odel %it can be determined by comparing the observed scores o% # (the values o% # %rom the sample o% data) with the e pected values o% # (the values o% # predicted by the regression equation). The di%%erence between these two values (the deviation" or residual as it is also called) provides an indication o% how well the model predicts each data point. !dding up the deviances %or all the data points a%ter they have been squared (this basically removes negative deviations) provides a simple measure o% the degree to which the data deviates %rom the model overall. The sum o% all the squared residuals is 2nown as the residual sum o% squares (/SS) and provides a measure o% model-%it %or an OLS regression model. ! poorly %itting model will deviate mar2edly %rom the data and will consequently have a relatively large /SS" whereas a good-%itting model will not deviate mar2edly %rom the data and will consequently have a relatively small /SS (a per%ectly %itting model will have an /SS equal to zero" as there will be no deviation between observed and e pected values o% #). &t is important to understand how the /SS statistic (or the deviance as it is also 2nown. see !gresti"-006" pages 96-97) operates as it is used to determine the signi%icance o% individual and groups o% variables in a regression model. ! graphical illustration o% the residuals %or a simple regression model is provided in ,igure 8. 9etailed e amples o% calculating deviances %rom residuals %or null and simple regression models can be %ound in :utcheson and 7outinho" 8;;<.
The deviance is an important statistic as it enables the contribution made by e planatory variables to the prediction o% the response variable to be determined. &% by adding a variable to the model" the deviance is greatly reduced" the added variable can be said to have had a large e%%ect on the prediction o% # %or that model. &%" on the other hand" the deviance is not greatly reduced" the added variable can be said to have had a small e%%ect on the prediction o% # %or that model. The change in the deviance that results %rom the e planatory variable being added to the model is used to determine the signi%icance o% that variable's e%%ect on the prediction o% # in that model. To assess the e%%ect that a single e planatory variable has on the prediction o% #" one simply compares the deviance statistics be%ore and a%ter the variable has been added to the model. ,or a simple OLS regression model" the e%%ect o% the e planatory variable can be assessed by comparing the /SS statistic %or the %ull regression model (# ( ) * + ) with that %or the null model (# ( )). The di%%erence in deviance between the nested models can then be tested %or signi%icance using an ,-test computed %rom the %ollowing equation.
F df
df
p+q
,df
p+q
/ df
p+q
where p represents the null model" # ( )" p+q represents the model # ( ) * + " and df are the degrees o% %reedom associated with the designated model. &t can be seen %rom this equation that the ,-statistic is simply based on the di%%erence in the deviances between the two models as a %raction o% the deviance o% the %ull model" whilst ta2ing account o% the number o% parameters. &n addition to the model-%it statistics" the /-square statistic is also commonly quoted and provides a measure that indicates the percentage o% variation in the response variable that is =e plained' by the model. /-square" which is also 2nown as the coe%%icient o% multiple determination" is de%ined as
R8 =
and basically gives the percentage o% the deviance in the response variable that can be accounted %or by adding the e planatory variable into the model. !lthough /-square is widely used" it will always increase as variables are added to the model (the deviance can only go down when additional variables are added to a model). One solution to this problem is to calculate an ad>usted /-square statistic (/ 8a) which ta2es into account the number o% terms entered into the model and does not necessarily increase as more terms are added. !d>usted /-square can be derived using the %ollowing equation
R8 a
k - R8 = R n k 8
where n is the number o% cases used to construct the model and k is the number o% terms in the model (not including the constant).
(calculated using so%tware) is &ce cream consumption ( ;.8;1 * ;.;;@ temperature. The parameter %or ) (;.8;1) indicates the predicted consumption when temperature is equal to zero. &t should be noted that although the parameter ) is required to ma2e predictions o% ice cream consumption at any given temperature" the prediction o% consumption at a temperature o% zero might be o% limited use%ulness" particularly when the observed data does not include a temperature o% zero in it's range (predictions should only be made within the limits o% the sampled values). The parameter + indicates that %or each unit increase in temperature" ice cream consumption increases by ;.;;@ units. The signi%icance o% the relationship between temperature and ice cream consumption can be estimated by comparing the deviance statistics %or the two nested models in the table below. one that includes temperature and one that does not. This di%%erence in deviance can be assessed %or signi%icance using the ,-statistic.
!ode0 consumption ( a consumption ( ) * + temperature de'iance (RSS) ;.-833 ;.;3;; d% 80 8< change in de'iance ;.;133 --statistic )-'a0ue
A8.8<
B.;;;-
On the basis o% this analysis" outdoor temperature would appear to be signi%icantly related to ice cream consumption with each unit increase in temperature being associated with an increase o% ;.;;@ units in ice cream consumption. Csing these statistics it is a simple matter to also compute the /-square statistic %or this model" which is ;.;133D;.-833" or ;.6;. Temperature Ee plainsF 6;4 o% the deviance in ice cream consumption (i.e." when temperature is added to the model" the deviance in the # variable is reduced by 6;4).
Jithin the range o% the data collected in this study" temperature and income appear to be signi%icantly related to ice cream consumption.
3onc0usion
OLS regression is one o% the ma>or techniques used to analyse data and %orms the basis o% many other techniques (%or e ample !KOL! and the Meneralised linear models" see /uther%ord" 8;;-). The use%ulness o% the technique can be greatly e tended with the use o% dummy variable coding to include grouped e planatory variables (see :utcheson and 7outinho" 8;;<" %or a discussion o% the analysis o% e perimental designs using regression) and data trans%ormation methods (see" %or e ample" ,o " 8;;8). OLS regression is particularly power%ul as it relatively easy to also chec2 the model asumption such as linearity" constant variance and the e%%ect o% outliers using simple graphical methods (see :utcheson and So%roniou" -000).
-urther Reading
!gresti" !. (-006). An Introduction to Categorical ata Anal!sis. Nohn Jiley and Sons" &nc. ,o " N. (8;;8). An R and S-"lus Co#panion to Applied Regression$ LondonO Sage Publications. :utcheson" M. 9. and 7outinho" L. (8;;<). Statistical 7odeling %or 7anagement. Sage Publications. :utcheson" M. 9. and So%roniou" K. (-000). %&e 'ulti(ariate Social Scientist$ LondonO Sage Publications. ?oteswara" /. ?. (-01;). Testing %or the &ndependence o% /egression 9isturbances. )cono#etrica" @<O"01--1. /uther%ord" !. (8;;-). &ntroducing !KOL! and !KQOL!O a ML7 approach. LondonO Sage Publications. /yan" T. P. (-001). 'odern Regression 'et&ods$ QhichesterO Nohn Jiley and Sons. Mraeme :utcheson 7anchester Cniversity