You are on page 1of 5

AUI

EMBA
Session: 2010-2012

Advanced Quantitative Methods & Statistics


Case Study 2
Rachid ZAÏR
zair@one.ma

Managerial Report
1. Simple linear regression models
1.1. Annual amount charged as a function of annual income
 Scatter plot

 Simple regression equation


Annual Amount Charged=2203.999+40.479×Annual Income (1000$s)

 Validity of the model


• Inference about the slope

Since: p-value = 9.01.10-7 < 5%, the linearity relationship is accepted (meaning that there is
sufficient evidence to affirm at 95% confidence level that the Annual Income affects the Annual Amount
Charged).
• Measures of errors
○ R-square

R2=0.3981. Though acceptable, the linear model accounts of only 39.81% of the variation of the
Annual Amount Charged.
○ Standard error

S=731.731. This is a relatively high standard error as it represents about 18% of the mean Annual
Amount Charged.
• Graphical analysis of residuals
○ Linearity, independence and equal variance assumptions could be checked graphically on the plot
of residuals:

○ Normality assumption could be checked visually through the normal probability plot of residuals:

1.1.Annual amount charged as a function of household size


 Scatter plot

 Simple regression equation


Annual Amount Charged=2581.941+404.128×Household Size

1
AUI
EMBA
Session: 2010-2012
 Validity of the model
• Inference about the slope

Since : p-value = 2.86.10-10 < 5%, the linearity relationship is accepted (meaning that
there is sufficient evidence to state with a 5% significance level that Household Size affects the Annual
Amount Charged).
• Measures of errors
○ R-square

R2=0.5667.
The linear model accounts of 56.67% of the variation of the Annual Amount Charged. Household Size
is a better predictor of the Annual Amount Charged than Annual Income.
○ Standard error

S=620.793. Though lower than the standard error of the previous model, this standard error is still
relatively high as it represents about 15% of the mean Annual Amount Charged.
• Graphical analysis of residuals
○ Linearity, independence and equal variance assumptions could be checked graphically on the
residuals plot:

○ Normality assumption could also be checked through the normal probability plot of residuals:

1. Multiple linear regression model


 Multiple regression equation
Annual Amount Charged=1304.904+33.133×Annual Income (1000$s)
+356.295×Household Size

 Validity of the model


• Inference about the slopes
P-values for the intercept and the slopes of the two independent variables (Annual Income and
Household Size) are all well below the significance level as indicated by this excerpt:
p-value

Intercept 3.28664E-08

Income ($1000s) 7.68206E-11

Household Size 3.12342E-14

The linearity relationship is thus accepted (meaning that there is sufficient evidence to state with 95%
confidence level that Annual Income and Household Size collectively affect the Annual Amount
Charged).
• Measures of errors
R-square 0.825561086

Adjusted R-square 0.818138154

2
AUI
EMBA
Session: 2010-2012

Standard error 398.0910071

Adjusted R-square of the fit is relatively high (> 80%) and is notably higher than any of the r-squares
obtained for the previous simple regression models.
In combination, the Annual Income and the Household Size explain 81.81% of the variation of the
Annual Amount Charged taking into account the number of variables and the sample size. Thus, when
aggregated these two independent variables bring more explanatory power than when taken in
isolation.
Standard error of the model, though it has decreased, is still high (about 10% of the mean).
• Graphical analysis of residuals
○ Linearity, independence and equal variance assumptions could be checked graphically on the
residuals plot:

○ Normality assumption could be checked through the normal probability plot of residuals:

1. Testing the existence of a linear relationship in the multiple regression model


 F-test for the overall significance of the model
• H0: “β1 = β2 = 0” (no linear relationship)
• H1: “At least one βi ≠ 0” (at least one independent variable affects the Annual Amount Charged).
• F-test statistic:

F=MSRMSE=111.21
• Significance level:

α=5%
• Degrees of freedom:

df1=k=2
df2=n-k-1=50-2-1=47
• F-test critical value:
FCV=FINV5%,2,47=3.195
• Conclusion - Since the F-test statistic is in the rejection region (being way greater than the F critical
value), we reject the null hypothesis and conclude that there is enough statistical evidence to state with 95%
confidence level that at least one independent variable affects the Annual Amount Charged.
 Individual significance of the independent variables
• T-test for the annual income variable
○ H0: “β1 = 0” (Annual Income does not affect the Annual Amount Charged)
○ H1: “β1 ≠ 0” (Annual Income affects the Annual Amount Charged).

3
AUI
EMBA
Session: 2010-2012
○ T-test statistic:

T=b1Sb1=33.1333.96=8.350

○ Significance level:

α=5%

○ Degree of freedom:

df=n-k-1=50-2-1=47

○ T-test critical value:

TCV=TINV5%,47=2.01

○ Conclusion - Since the T-test statistic is in the rejection region (being greater than the T critical
value), we reject the null hypothesis and conclude that there is enough statistical evidence to state with
95% confidence level that the Annual Income affects the Annual Amount Charged.

• T-test for the household size variable


○ H0: “β2 = 0” (Household Size does not affect the Annual Amount Charged)
○ H1: “β2 ≠ 0” (Household Size affects the Annual Amount Charged).

○ T-test statistic:

T=b2Sb2=356.29533.20=10.73

○ Significance level:

α=5%

○ Degree of freedom:

df=n-k-1=50-2-1=47

○ T-test critical value:

TCV=TINV5%,47=2.01

○ Conclusion - Since the T-test statistic is in the rejection region (being greater than the T critical
value), we reject the null hypothesis and conclude that there is enough statistical evidence to state with
95% confidence level that the Household Size affects the Annual Amount Charged.

1. Checking for the existence of interaction between the two explanatory


variables
To check for interaction between the two independent variables, we will test for the possibility to build a multiple
regression model that includes a third variable constructed as follow:

X3=X1×X2

With:

4
AUI
EMBA
Session: 2010-2012
X1: Annual Income.

X2: Household Size.

The new model will be written as follow:

Y = b0+b1.X1+b2.X2+b3.X3

T-test null and alternative hypotheses:


• H0: “β3 = 0” (There is no interaction between Annual Income and Household Size).
• H1: “β3 ≠ 0” (There is interaction between Annual Income and Household Size).
Slopes test results for the new model:
Coefficients P-value
Intercept 1301.482189 0.00466269
X1 : Income ($1000s) 33.2175634 0.002577635
X2 : Household Size 357.281731 0.003762056
X3 : Annual Income and Household Size -0.023698297 0.993022952

The p-value for the slope test corresponding to the third variable is considerably high. We thus fail to reject the null
hypothesis and conclude that there is enough statistical evidence to affirm at a 5% significance level that there is no
interaction between the two explanatory variables.

1. Need for additional explanatory variables


Over the three linear models constructed above, the coefficients of determination kept increasing while the standard
errors kept decreasing, showing an increasing improvement in the quality of fit of the constructed linear models.
However, standard error of the third model remained high (about 10% of the mean of the predicted variable). This
demonstrates the need to search for additional independent variables to include in the model.
Predictor variables to be suggested need to be relevant and independent from the ones already integrated in the
model in order to enhance the overall quality of the fit and avoid the problem of multicollinearity. Here are some
variables I can think of:

(1) Cash flow – This variable would help in depicting the modern situation of many people having a high income
with limited access to that income since most of it is used to pay off bills and loans. This variable could be more
difficult to gather but it certainly will add more prediction power to the model and it should be rather
independent from the ones already integrated in the model.
(2) Percentage of females in household – This variable will help in refining the information about the internal
structure of households. It will certainly add to the prediction power of the model based on the common belief
that females like shopping (precisely by cards) more than men. This variable is easy to collect.
(3) Purchase preference: cash or credit – People in certain households may have more propensity towards
spending by cash than by credit card. This dummy variable can add some additional prediction to the model
although it might have some overlap (interaction) with the variable (2).

You might also like