Professional Documents
Culture Documents
Learn.
Chapter 12
Section 12.1
How Can We Use Several Variables to Predict a Response?
Regression Models
The
model that contains only two variables, x and y, is called a bivariate model
Regression Models
The
= + x
y
Regression Models
Suppose This
there are two predictors, denoted by x1 and x2 is called a multiple regression model
Regression Models
The
regression equation for this multiple regression model with two predictors is:
= + x + x
y 1 1 2
multiple regression model relates the mean y of a quantitative response variable y to a set of explanatory variables x1, x2,.
= + x + x + x
y 1 1 2 2 3
y = a+bx +b x +bx
1 1 2 2 3
Agresti/Franklin Statistics, 9 of 141
data set house selling prices contains observations on 100 home sales in Florida in November 2003 A multiple regression analysis was done with selling price as the response variable and with house size and lot size as the explanatory variables
Agresti/Franklin Statistics, 10 of 141
Equation:
1 2
The residual tells us that the actual selling price was $37,724 higher than predicted
should not use many explanatory variables in a multiple regression model unless you have lots of data A rough guideline is that the sample size n should be at least 10 times the number of explanatory variables
Agresti/Franklin Statistics, 15 of 141
Plotting Relationships
Always
look at the data before doing a multiple regression software has the option of constructing scatterplots on a single graph for each pair of variables
Most
Plotting Relationships
simplest way to interpret a multiple regression equation looks at it in two dimensions as a function of a single explanatory variable can look at it this way by fixing values for the other explanatory variable(s)
Agresti/Franklin Statistics, 18 of 141
We
= 74,676 + 53.8x
the slope coefficient of x1 is 53.8, the predicted selling price for houses with a lot size of 30,000 sq. ft. increases by $53.80 for every square foot increase in house size
In summary, an increase of a square foot in house size has a larger impact on the selling price ($53.80) than an increase of a square foot in lot size ($2.84) We can compare slopes for these explanatory variables because their units of measurement are the same (square feet) Slopes cannot be compared when the units differ
Agresti/Franklin Statistics, 23 of 141
multiple regression model assumes that the slope for a particular explanatory variable is identical for all fixed values of the other explanatory variables
multiple regression, a slope describes the effect of an explanatory variable while controlling effects of the other explanatory variables in the model
regression has only a single explanatory variable A slope in bivariate regression describes the effect of that variable while ignoring all other possible explanatory variables
of the main uses of multiple regression is to identify potential lurking variables and control for them by including them as explanatory variables in the model
For all students at Walden Univ., the prediction equation for y = college GPA and x1= H.S. GPA and x2= study time is:
Find the predicted college GPA of a student who has a H.S. GPA of 3.5 and who studies 3 hrs. per day. a. 3.67 b. 3.005 c. 3.175 d. 3.4
For all students at Walden Univ., the prediction equation for y = college GPA and x1= H.S. GPA and x2= study time is:
For students with fixed study time, what is the change in predicted college GPA when H.S. GPA increases from 3.0 to 4.0? a. 1.13 b. 0.0078 c. 0.643 d. 1.00
Section 12.2 Extending the Correlation and RSquared for Multiple Regression
Multiple Correlation
To
summarize how well a multiple regression model predicts y, we analyze how well the observed y values correlate with the predicted y values The multiple correlation is the correlation between the observed y values and the predicted y values
It is denoted by R
Multiple Correlation
For each subject, the regression equation provides a predicted value Each subject has an observed y-value and a predicted y-value
Multiple Correlation
The
correlation computed between all pairs of observed y-values and predicted y-values is the multiple correlation, R larger the multiple correlation, the better are the predictions of y by the set of explanatory variables
Agresti/Franklin Statistics, 35 of 141
The
Multiple Correlation
The
R-value always falls between 0 and 1 In this way, the multiple correlation R differs from the bivariate correlation r between y and a single variable x, which falls between -1 and +1
Agresti/Franklin Statistics, 36 of 141
R-squared
For
predicting y, the square of R describes the relative improvement from using the prediction equation instead of using the sample mean, y
R-squared
The error in using the prediction equation to predict y is summarized by the residual sum of squares:
( y y)
R-squared
( y y )
R-squared
( y y ) ( y y) R = ( y y)
2 2 2
R-squared
The
better the predictions are using the regression equation, the larger R2 is multiple regression, R2 is the square of the multiple correlation, R
For
the 100 observations on y = selling price, x1 = house size, and x2 = lot size, a table, called the ANOVA (analysis of variance) table was created The table displays the sums of squares in the SS column
Agresti/Franklin Statistics, 42 of 141
The R2 value can be created from the sums of squares in the table
R =
2
( y y ) ( y y)
2 2
house size and lot size together to predict selling price reduces the prediction error by 71%, relative to using y alone to predict selling price
R = R = 0.711= 0.84
2
There is a strong association between the observed and the predicted selling prices House size and lot size very much help us to predict selling prices
Agresti/Franklin Statistics, 45 of 141
we used a bivariate regression model to predict selling price with house size as the predictor, the r2 value would be 0.58 we used a bivariate regression model to predict selling price with lot size as the predictor, the r2 value would be 0.51
Agresti/Franklin Statistics, 46 of 141
If
multiple regression model has R2 0.71, so it provides better predictions than either bivariate model
Properties of R2
The
previous example showed that R2 for the multiple regression model was larger than r2 for a bivariate model using only one of the explanatory variables A key factor of R2 is that it cannot decrease when predictors are added to a model
Agresti/Franklin Statistics, 48 of 141
Properties of R2
R2 falls between 0 and 1 The larger the value, the better the explanatory variables collectively predict y R2 =1 only when all residuals are 0, that is, when all regression predictions are prefect R2 = 0 when the correlation between y and each explanatory variable equals 0
Agresti/Franklin Statistics, 49 of 141
Properties of R2
R2
gets larger, or at worst stays the same, whenever an explanatory variable is added to the multiple regression model value of R2 does not depend on the units of measurement
The
The single predictor in the data set that is most strongly associated with y is the houses real estate tax assessment
(r2 = 0.679)
When we add house size as a second predictor, R2 goes up from 0.679 to 0.730 As other predictors are added, R2 continues to go up, but not by much
R2 does not increase much after a few predictors are in the model When there are many explanatory variables but the correlations among them are strong, once you have included a few of them in the model, R2 usually doesnt increase much more when you add additional ones
Agresti/Franklin Statistics, 53 of 141
This does not mean that the additional variables are uncorrelated with the response variable It merely means that they dont add much new power for predicting y, given the values of the predictors already in the model
In a data set used to predict body weight (in pounds), three predictors were used: height, percent body fat and age. Their correlations with total body weight were: Height: 0.745 Percent Body fat: 0.390 Age: -0.187
Which explanatory variable gives by itself the best prediction of weight? a. Height b. Percent body fat c. Age
In a data set used to predict body weight (in pounds), three predictors were used: height, percent body fat and age. Their correlations with total body weight were: Height: 0.745 Percent Body fat: 0.390 Age: -0.187
With height as the sole predictor, what is r2? a. .745 b. .555 c. .625 d. .825
In a data set used to predict body weight (in pounds), three predictors were used: height, percent body fat and age. Their correlations with total body weight were: Height: 0.745 Percent Body fat: 0.390 Age: -0.187
If Percent Body Fat is added to the model R2 = 0.66. If Age is then added to the model R2=0.67. Once you know height and % body fat, does age seem to help in predicting weight? a. No b. Yes
Agresti/Franklin Statistics, 57 of 141
required when using a multiple regression model to make inferences about the population:
required when using a multiple regression model to make inferences about the population:
randomization The response variable y has a normal distribution at each combination of values of the explanatory variables, with the same standard deviation
Consider a particular parameter, 1 If 1= 0, the mean of y is identical for all values of x1, at fixed values of the other explanatory variables So, H0: 1= 0 states that y and x1 are statistically independent, controlling for the other variables This means that once the other explanatory variables are in the model, it doesnt help to have x1 in the model
Agresti/Franklin Statistics, 61 of 141
Each explanatory variable has a straightline relation with y with the same slope for all combinations of values of other predictors in the model Data gathered with randomization Normal distribution for y with same standard deviation at each combination of values of other predictors in model
Agresti/Franklin Statistics, 62 of 141
b 0 t = se
1
College Athletes data set comes from a study of 64 University of Georgia female athletes The study measured several physical characteristics, including total body weight in pounds (TBW), height in inches (HGT), the percent of body fat (%BF) and age
Agresti/Franklin Statistics, 67 of 141
The results of fitting a multiple regression model for predicting weight using the other variables:
Let y = predicted weight, x1 = height, x2 = % body fat, and x3 = age Then y = 97.7 + 3.43 x1 + 1.36 x2 0.96 x3
athletes having fixed values for x1 and x2, the predicted weight decreases by 0.96 pounds for a 1-year increase in age, and the ages vary only between 17 and 23
a hypothesis test to determine whether age helps to predict weight, if you already know height and percent body fat
The 64 female athletes were a convenience sample, not a random sample Caution should be taken when making inferences about all female college athletes
Ha: 3 0
2. Test statistic:
3
Age does not significantly predict weight if we already know height and % body fat
Agresti/Franklin Statistics, 74 of 141
Estimated slope t ( se )
.025
Construct and interpret a 95% CI for 3, the effect of age while controlling for height and % body fat
fixed values of x1 and x2, we infer that the population mean of weight changes very little (and maybe not at all) for a 1-year increase in age The confidence interval contains 0
control for height and % body fat
A standard deviation parameter, , describes variability of the observations around the regression equation Its sample estimate is:
s=
Residual SS = df ( y y)
2
For female athletes at particular values of height, % of body fat, and age, estimate the standard deviation of their weights Begin by finding the Mean Square Error:
Notice that this value (102.2) appears in the MS column in the ANOVA table
Agresti/Franklin Statistics, 80 of 141
s = 102.2 = 10.1
This value is also displayed in the ANOVA table For athletes with certain fixed values of height, % body fat, and age, the weights vary with a standard deviation of about 10 pounds
the conditional distributions of weight are approximately bell-shaped, about 95% of the weight values fall within about 2s = 20 pounds of the true regression line
H : = = =0
0 1 2 3
H0 is true, the expected value of the F test statistic is approximately 1 When H is false, F tends to be larger 0 than 1 The larger the F test statistic, the stronger the evidence against H0
H : = = =0
0 1 2 3
3. Test statistic:
the 64 female college athletes, the regression model for predicting y = weight using x1 = height, x2 = % body fat and x3 = age is summarized in the ANOVA table on the next page
H : = = =0
0 1 2 3
observed F statistic is 40.48 The corresponding P-value is 0.000 We can reject H at the 0.05 0 significance level We conclude that at least one predictor has an effect on weight
Agresti/Franklin Statistics, 93 of 141
The F-test tells us that at least one explanatory variable has an effect If the explanatory variables are chosen sensibly, at least one should have some predictive power The F-test result tells us whether there is sufficient evidence to make it worthwhile to consider the individual effects, using t-tests
The individual t-tests identify which of the variables are significant (controlling for the other variables)
a variable turns out not to be significant, it can be removed from the model In this example, age can be removed from the model
The regression equation approximates well the true relationship between the predictors and the mean of y The data were gathered randomly y has a normal distribution with the same standard deviation at each combination of predictors
Agresti/Franklin Statistics, 98 of 141
To test Assumption 3 (the conditional distribution of y is normal at any fixed values of the explanatory variables):
residuals The histogram should be approximately bellshaped Nearly all the standardized residuals should fall between -3 and +3. Any residual outside these limits is a potential outlier
the house selling price data, a MINITAB histogram of the standardized residuals for the multiple regression model predicting selling price by the house size and the lot size was created and is displayed on the following page
Agresti/Franklin Statistics, 100 of 141
residuals are roughly bell shaped about 0 They fall between about -3 and +3 No severe nonnormality is indicated
Plots of residuals against each explanatory variable help us check for potential problems with the regression model Ideally, the residuals should fluctuate randomly about 0 There should be no obvious change in trend or change in variation as the values of the explanatory variable increases
Indicator Variables
Regression
models specify categories of a categorical explanatory variable using artificial variables, called indicator variables The indicator variable for a particular category is binary
Indicator Variables
In
the house selling prices data set, the city region in which a house is located is a categorical variable
The
Indicator Variables
The
coefficient of the indicator variable x is the difference between the mean selling prices for homes in the NW and for homes not in the NW
Output from the regression model for selling price of home using house size and region
and plot the lines showing how predicted selling price varies as a function of house size, for homes in the NW and for homes no in the NW
y = 15,258 + 78 .0 x + 30 ,569 x
1
For homes not in the NW, x2 = 0 The prediction equation then simplifies to:
= - 15,258 + 78.0 x
For homes in the NW, x2 = 1 The prediction equation then simplifies to:
= 15,311 + 78.0 x
lines have the same slope, 78 For homes in the NW and for homes not in the NW, the predicted selling price increases by $78 for each square-foot increase in house size The figure portrays a separate line for each category of region (NW, not NW)
coefficient of the indicator variable is 30569 any fixed value of house size, we predict that the selling price is $30,569 higher for homes in the NW
For
The line for homes in the NW is above the line for homes not in the NW The predicted selling price is higher for homes in the NW The P-value of 0.000 for the test for the coefficient of the indicator variable suggests that this difference is statistically significant
Is There Interaction?
For
two explanatory variables, interaction exists between them in their effects on the response variable when the slope of the relationship between y and one of them changes as the value of the other changes
Is There Interaction?
A voters choice in an election (Democrat or Republican), with explanatory variables: annual income, political ideology, religious affiliation, and race Whether a credit card holder pays their bill on time (yes or no), with explanatory variables: family income and the number of months in the past year that the customer paid the bill on time
Agresti/Franklin Statistics, 122 of 141
Denote the possible outcomes for y as 0 and 1 Use the generic terms failure (for outcome = 0) and success (for outcome =1) The population mean of the scores equals the population proportion of 1 outcomes (successes)
That is, y = p
The proportion, p, also represents the probability that a randomly selected subject has a success outcome
Agresti/Franklin Statistics, 123 of 141
straight-line model is usually inadequate more realistic model has a curved S-shape instead of a straight-line trend
regression equation for an Sshaped curve for the probability of success p is:
e p= 1+e
( +x )
( +x )
An Italian study with 100 randomly selected Italian adults considered factors that are associated with whether a person possesses at least one travel credit card The table on the next page shows results for the first 15 people on this response variable and on the persons annual income (in thousands of euros)
Agresti/Franklin Statistics, 126 of 141
Let x = annual income and let y = whether the person possesses a travel credit card (1 = yes, 0 = no)
Substituting the and estimates into the logistic regression model formula yields:
e p= 1+ e
( 3.52 + 0.105 x )
( 3.52 + 0.105 x )
the estimated probability of possessing a travel credit card at the lowest and highest annual income levels in the sample, which were x = 12 and x = 65
For x = 12 thousand euros, the estimated probability of possessing a travel credit card is:
2.26
2.26
For x = 65 thousand euros, the estimated probability of possessing a travel credit card is:
3.305
3.305
income has a strong positive effect on having a credit card The estimated probability of having a travel credit card changes from 0.09 to 0.97 as annual income changes over its range
A three-variable contingency table from a survey of senior high-school students in shown on the next page students were asked whether they had ever used: alcohol, cigarettes or marijuana
The
y indicate marijuana use, coded: (1 = yes, 0 = no) x1 be an indicator variable for alcohol use (1 = yes, 0 = no) x2 be an indicator variable for cigarette use (1 = yes, 0 = no)
Agresti/Franklin Statistics, 136 of 141
Let
Let
e p= 1+ e
For those who have not used alcohol or cigarettes, x1= x2 = 0 and:
e p= 1+ e
= 0.005
For those who have used alcohol and cigarettes, x1= x2 = 1 and:
e p= 1+ e
= 0.628
probability that students have tried marijuana seems to depend greatly on whether theyve used alcohol and cigarettes