You are on page 1of 26

Assignment of BRM on Regression

Submitted by:

Vivek Gupta

10BSP0054
DIFFERENCE BETWEEN R-square AND ADJUSTED R-square:

R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of
determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates
that the regression line perfectly fits the data.

Adjusted R2 is a modification of R2 that adjusts for the number of explanatory terms in a model. Unlike R2, the adjusted R2
increases only if the new term improves the model more than would be expected by chance. The adjusted R2 can be negative,
and will always be less than or equal to R2.

Adjusted R2 is not always better than R2: adjusted R2 will be more useful only if the R2 is calculated based on a sample, not
the entire population. For example, if our unit of analysis is a state, and we have data for all counties, then adjusted R2 will not
yield any more useful information than R2.

GAUSS-MARKOV THEOREM

In statistics, the GaussMarkov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear
regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best
linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator. Here "best" means
giving the lowest possible mean squared error of the estimate. The errors need not be normal, nor independent and identically
distributed .

Suppose we have

The GaussMarkov assumptions state that

A linear estimator of j is a linear combination

In which the coefficients cij are not allowed to depend on the underlying coefficients j, since those are not observable, but are
allowed to depend on the values Xij, since these data are observable. (The dependence of the coefficients on each Xij is
typically nonlinear; the estimator is linear in each Yi and hence in each random i, which is why this is "linear" regression.) The
estimator is said to be unbiased if and only if
Regardless of the values of Xij. Now, let be some linear combination of the coefficients. Then the mean squared
error of the corresponding estimation is

i.e., it is the expectation of the square of the weighted sum (across parameters) of the differences between the estimators and
the corresponding parameters to be estimated. (Since we are considering the case in which all the parameter estimates are
unbiased, this mean squared error is the same as the variance of the linear combination.) The best linear unbiased
estimator (BLUE) of the vector of parameters j is one with the smallest mean squared error for every vector of linear
combination parameters. This is equivalent to the condition that

Is a positive semi-definite matrix for every other linear unbiased estimator ?

The ordinary least squares estimator (OLS) is the function

Of Y and X that minimizes the sum of squares of residuals (misprediction amounts):


VARIANCE INFLATION FACTOR
In statistics, the variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least
squares regression analysis. It provides an index that measures how much the variance of an estimated regression coefficient
(the square of the estimate's standard deviation) is increased because of co linearity.

Definition

Consider the following linear model with k independent variables:

Y = 0 + 1 X1 + 2 X 2 + ... + k Xk + .

The standard error of the estimate of j is the j+1, j+1 element of (XX) 1/2, where is the standard deviation of the error term
and X is the regression design matrix a matrix such that Xi, j+1 is the value of the jth covariate for the ith case or observation,
and Xi, 1 equals 1 for all i. It turns out that the square of this standard error, the variance of the estimate of j, can be
equivalently expressed as-

Calculation and analysis


The VIF can be calculated and analyzed in three steps:

Step one

Calculate k different VIFs, one for each Xi by first running an ordinary least square regression that has Xi as a function of all
the other explanatory variables in the first equation.
If i = 1, for example, the equation would be

Where c0 is a constant and e is the error term.

Step two

Then, calculate the VIF factor for with the following formula:
Step three

Analyze the magnitude of multicollinearity by considering the size them . A common rule of thumb is that
if then multicollinearity is high. Also 10 has been proposed (see Kutner book referenced below) as a cut off
value.

Some software calculates the tolerance which is just the reciprocal of the VIF.The choice of which to use is a matter of
personal preference of the researcher.

Interpretation

The square root of the variance inflation factor tells you how much larger the standard error is, compared with what it would
be if that variable were uncorrelated with the other independent variables in the equation.

MULTICOLLINEARITY
In statistics, the occurrences of several independent variables in a multiple regression model are closely correlated to one
another. Multicollinearity can cause strange results when attempting to study how well individual independent variables
contribute to an understanding of the dependent variable. In general, multicollinearity can cause wide confidence intervals and
strange P values for independent variables.
Multicollinearity suggests that several of the independent variables are closely linked in some way. Once the collinear
variables are identified, it may be helpful to study whether there is a causal link between the variables. The simplest way to
resolve multicollinearity problems is to reduce the number of collinear variables until there is only one remaining out of the
set. Sometimes, after some study it may be possible to identify one of the variables as being extraneous. Alternatively, it may
be possible to combine two or more closely related variables into

Why is multicollinearity a problem?

If your goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still
be accurate, and the overall R2 (or adjusted R2) quantifies how well the model predicts the Y values.

If your goal is to understand how the various X variables impact Y, then multicollinearity is a big problem. One problem is
that the individual P values can be misleading (a P value can be high, even though the variable is important). The second
problem is that the confidence intervals on the regression coefficients will be very wide. The confidence intervals may even
include zero, which means you cant even be confident whether an increase in the X value is associated with an increase, or a
decrease, in Y. Because the confidence intervals are so wide, excluding a subject (or adding a new one) can change the
coefficients dramatically and may even change their signs

Heteroscedasticity
Plot with random data showing heteroscedasticity.

In statistics, a sequence of random variables is heteroscedastic, or heteroskedastic, if the random variables have
different variances. The term means "differing variance" and comes from the Greek "hetero" ('different') and "skedasis"
('dispersion'). In contrast, a sequence of random variables is calledhomoscedastic if it has constant variance.

Suppose there is a sequence of random variables {Yt}t=1n and a sequence of vectors of random variables, {Xt} t=1n. In dealing
with conditional expectations ofYt given Xt, the sequence {Yt}t=1n is said to be heteroskedastic if the conditional variance of
Yt given Xt, changes with t. Some authors refer to this as conditional heteroscedasticity to emphasize the fact that it is the
sequence of conditional variance that changes and not the unconditional variance. In fact it is possible to observe conditional
heteroscedasticity even when dealing with a sequence of unconditional homoscedastic random variables, however, the
opposite does not hold.

When using some statistical techniques, such as ordinary least squares (OLS), a number of assumptions are typically made.
One of these is that the error term has a constant variance. This might not be true even if the error term is assumed to be drawn
from identical distributions.
For example, the error term could vary or increase with each observation, something that is often the case with cross-
sectional or time series measurements. Heteroscedasticity is often studied as part of econometrics, which frequently deals with
data exhibiting it. White's influential paper (White 1980) used "heteroskedasticity" instead of "heteroscedasticity" whereas
subsequent Econometrics textbooks such as Gujarati et al. Basic Econometrics (2009) use "heteroscedasticity."

With the advent of robust standard errors allowing for inference without specifying the conditional second moment of error
term, testing conditional homoscedasticity is not as important as in the past.[citation needed]

The econometrician Robert Engle won the 2003 Nobel Memorial Prize for Economics for his studies on regression analysis in
the presence of heteroscedasticity, which led to his formulation of theAutoregressive conditional heteroskedasticity (ARCH)
modelling technique.

Examples

Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.

A classic example of heteroscedasticity is that of income versus expenditure on meals. As one's income increases, the
variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating less
expensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with
higher incomes display a greater variability of food consumption.

Imagine you are watching a rocket take off nearby and measuring the distance it has travelled once each second. In the first
couple of seconds your measurements may be accurate to the nearest centimetre, say. However, 5 minutes later as the rocket
recedes into space, the accuracy of your measurements may only be good to 100 m, because of the increased distance,
atmospheric distortion and a variety of other factors. The data you collect would exhibit heteroscedasticity

Homoscedasticity

Plot with random data showing homoscedasticity.

In statistics, a sequence or a vector of random variables is homoscedastic all random variables in the sequence or vector have
the same finite variance. This is also known as homogeneity of variance. The complementary notion is
called heteroscedasticity. The assumption of homoscedasticity simplifies mathematical and computational treatment. Serious
violations in homoscedasticity (assuming a distribution of data is homoscedastic when in actuality it is heteroscedasticity
result in overestimating the goodness of fit as measured by the Pearson coefficient.

Assumptions of a regression model


As used in describing simple linear regression analysis, one assumption of the fitted model (to ensure that the least-squares
estimators are each a best linear unbiased estimator of the respective population parameters, by the Gauss-Markov theorem) is
that the standard deviations of the error terms are constant and do not depend on the x-value. Consequently, each probability
distribution for y (response variable) has the same standard deviation regardless of the x-value (predictor). In short, this
assumption is homoscedasticity. Homoscedasticity is not required for the estimates to be unbiased, consistent, and
asymptotically normal.

DurbinWatson statistic

In statistics, the DurbinWatson statistic is a test statistic used to detect the presence of autocorrelation (a relationship
between values separated from each other by a given time lag) in the residuals(prediction errors) from a regression analysis. It
is named after James Durbin and Geoffrey Watson. However, the small sample distribution of this ratio was derived in a path-
breaking article by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions,
and developed bounds tests for the null hypothesis that the errors are serially independent (not autocorrelated) against the
alternative that they follow a first order autoregressive process. Later, John Denis Sargan and Alok Bhargava developed
several von Neumann-Durbin-Watson type test statistics for the null hypothesis that the errors on a regression model follow a
process with a unit root against the alternative hypothesis that the errors follow a stationary first order autoregression.
Computing and interpreting the Durbin-Watson statistic

If et is the residual associated with the observation at time t, then the test statistic is

where T is the number of observations. Since d is approximately equal to 2(1-r), where r is the sample autocorrelation of the
residuals,[1] d = 2 indicates no autocorrelation. The value of d always lies between 0 and 4. If the DurbinWatson statistic is
substantially less than 2, there is evidence of positive serial correlation. As a rough rule of thumb, if DurbinWatson is less
than 1.0, there may be cause for alarm. Small values of d indicate successive error terms are, on average, close in value to one
another, or positively correlated. If d > 2 successive error terms are, on average, much different in value to one another, i.e.,
negatively correlated. In regressions, this can imply an underestimation of the level of statistical significance.

To test for positive autocorrelation at significance , the test statistic d is compared to lower and upper critical values
(dL, and dU,):

If d < dL,, there is statistical evidence that the error terms are positively auto-correlated.

If d > dU,, there is statistical evidence that the error terms are not positively auto-correlated.
If dL, < d < dU,, the test is inconclusive.

To test for negative autocorrelation at significance , the test statistic (4 d) is compared to lower and upper critical values
(dL, and dU,):

If (4 d) < dL,, there is statistical evidence that the error terms are negatively autocorrelated.

If (4 d) > dU,, there is statistical evidence that the error terms are not negatively autocorrelated.

If dL, < (4 d) < dU,, the test is inconclusive.

The critical values, dL, and dU,, vary by level of significance (), the number of observations, and the number of predictors in
the regression equation. Their derivation is complexstatisticians typically obtain them from the appendices of statistical
texts.

An important note is that the DurbinWatson statistic, while displayed by many regression analysis programs, is not relevant
in many situations. For instance, if the error distribution is not normal, if there is higher-order autocorrelation, or if the
dependent variable is in a lagged form as an independent variable, this is not an appropriate test for autocorrelation. A
suggested test that does not have these limitations is the Breusch-Godfrey (Serial Correlation LM) Test.

REGRESSION ANALYSIS OF ONE INDEPENDENT VARIABLE


Variables Entered/Removedb

Model Variables Variables


Entered Removed Method

dimen
sion0
1 X6 - Product . Enter
Qualitya

a. All requested variables entered.

b. Dependent Variable: X19 Satisfaction

Model Summaryb

Model Change Statistics


Std. Error
Adjusted R of the R Square
R R Square Square Estimate Change F Change df1 df2 Sig. F Change Durbin-Watson

1 .486a .237 .229 1.0467 .237 30.358 1 98 .000 2.097


dimens
ion0

a. Predictors: (Constant), X6 - Product Quality

b. Dependent Variable: X19 Satisfaction


Analysis of Model summary:

As predictors are added to the model, each predictor will explain some of variance in the dependent variable simply due
to chance

One could continue to add predictors to the model which would continue to improve the ability of the predictors to
explain the dependent variable, although some of this increase in R square would be simply due to chance variation in
that particular sample

The adjusted R Square attempts to yield a more honest value to estimate the r square for the population. The value of the
R square is 0.237, while the value of the adjusted r Square is 0.229. There is not much difference because we are dealing
with only one variable

By contrast, when the number of observation is very large compare to the number of predictors, the value of R Square &
adjusted R Square will be much closer
ANOVAb

Coefficientsa Model Sum of Mean


Squares df Square F Sig.

1 Regressio 33.260 1 33.260 30.358 .000a


n

Residual 107.367 98 1.096

Total 140.628 99
Model Standardized
Unstandardized Coefficients Coefficients a. Predictors: (Constant), X6 - Product Quality

B Std. Error Beta T Sig. b. Dependent Variable: X19 Satisfaction

1 (Constant) 3.676 .598 6.151 .000

X6 - Product Quality .415 .075 .486 5.510 .000 Analysis of Anova:


a. Dependent Variable: X19 Satisfaction

Both dependent variable together explain 23.7 percent of the variance (R Square) in satisfaction level, which is highly
significant as indicated by the F value of 30.358 in the above.

The P value is compared to the significance level typivally 0.05 & if smaller we can conclude that the independent
variables reliably predict the dependent variables.
Analysis of Coefficients:

An examination of t value indicates that the product quality contributes to the prediction of satisfaction level.

REGRESSION EQUATION : Y = 0.415+0.486X1


R SQUARE : 0.237
F VALUE : 30.358
REGRESSION ANALYSIS OF ALL INDEPENDENT VARIABLE:
Variables Entered/Removedb

Model Variables Entered Variables Removed Method

1 X18 - Delivery . Enter


Speed, X8 -
Technical Support,
X6 - Product
Quality, X15 - New
Products, X7 - E-
Commerce
Activities, X10 -
Advertising, X13 -
dimens
ion0
Competitive
Pricing, X16 -
Order & Billing,
X17 - Price
Flexibility, X14 -
Warranty & Claims,
X12 - Salesforce
Image, X9 -
Complaint
Resolution, X11 -
Product Linea

a. All requested variables entered.

b. Dependent Variable: X19 Satisfaction


Variables Entered/Removedb

Model Variables Entered Variables Removed Method

1 X18 - Delivery . Enter


Speed, X8 -
Technical Support,
X6 - Product
Quality, X15 - New
Products, X7 - E-
Commerce
Activities, X10 -
Advertising, X13 -
dimens
ion0
Competitive
Pricing, X16 -
Order & Billing,
X17 - Price
Flexibility, X14 -
Warranty & Claims,
X12 - Salesforce
Image, X9 -
Complaint
Resolution, X11 -
Product Linea

a. All requested variables entered.

b. Dependent Variable: X19 Satisfaction


Analysis of Model summary:

As predictors are added to the model, each predictor will explain some of variance in the dependent variable simply due
to chance

One could continue to add predictors to the model which would continue to improve the ability of the predictors to
explain the dependent variable, although some of this increase in R square would be simply due to chance variation in
that particular sample

The adjusted R Square attempts to yield a more honest value to estimate the r square for the population. The value of the
R square is 0.804, while the value of the adjusted r Square is 0.774. There is much difference because we are dealing
with all variables

By adding more variables in the model the difference between R Square & adjusted R Square will increase.
ANOVAb

Model Sum of Squares Df Mean Square F Sig.

1 Regression 113.044 13 8.696 27.111 .000a

Residual 27.584 86 .321

Total 140.628 99

a. Predictors: (Constant), X18 - Delivery Speed, X8 - Technical Support, X6 - Product Quality, X15 - New
Products, X7 - E-Commerce Activities, X10 - Advertising, X13 - Competitive Pricing, X16 - Order & Billing,
X17 - Price Flexibility, X14 - Warranty & Claims, X12 - Salesforce Image, X9 - Complaint Resolution, X11 -
Product Line

b. Dependent Variable: X19 Satisfaction

Analysis of Anova:

Both dependent variable together explain 80.4 percent of the variance (R Square) in in satisfaction level, which is highly
significant as indicated by the F value of 27.111 in the above.
The P value is compared to the significance level typivally 0.05 & if smaller we can conclude that the independent variables
reliably predict the dependent variables.
Coefficientsa

Model Standardized
Unstandardized Coefficients Coefficients

B Std. Error Beta t Sig.

1 (Constant) -1.336 1.120 -1.192 .236

X6 - Product Quality .377 .053 .442 7.161 .000

X7 - E-Commerce Activities -.456 .137 -.268 -3.341 .001

X8 - Technical Support .035 .065 .045 .542 .589

X9 - Complaint Resolution .154 .104 .156 1.489 .140

X10 - Advertising -.034 .063 -.033 -.548 .585

X11 - Product Line .362 .267 .400 1.359 .178

X12 - Salesforce Image .827 .101 .744 8.155 .000

X13 - Competitive Pricing -.047 .048 -.062 -.985 .328

X14 - Warranty & Claims -.107 .126 -.074 -.852 .397

X15 - New Products -.003 .040 -.004 -.074 .941

X16 - Order & Billing .143 .105 .111 1.369 .175

X17 - Price Flexibility .238 .272 .241 .873 .385

X18 - Delivery Speed -.249 .514 -.154 -.485 .629

a. Dependent Variable: X19 Satisfaction


Analysis of Coefficients:

An examination of t value indicates that the all independent variables contributes to the prediction of dependent variable
i.e. satisfaction level.

-REGRESSION EQUATION : Y = -0.096+0.442X1-0.268X2+0.045X3+0.156X4-0.033X5+0.4X6+0.744X7-


0.062X8-0.074X9-0.004X10+0.111X11+0.241X12-0.154X13

-R SQUARE : 0.804
-F VALUE : 27.111

You might also like