You are on page 1of 143

Business Analytics

Basic Statistic & Linear Regression


Day 2 & 3

EduPristine www.edupristine.com
EduPristine | Business Analytics
Agenda

Basic Statistics

Predictive modeling using Linear Regression

EduPristine | Business Analytics 1


Basic Statistics
I. The Central Limit Theorem

II. Hypothesis testing

EduPristine | Business Analytics 2


Sampling
A probability sample is a sample selected such that each item or person in the population being studied
has a known likelihood of being included in the sample.

The sampling distribution of the sample mean is a probability distribution consisting of all possible sample
means of a given sample size selected from a population.

Need for Sampling:

The physical impossibility of checking all items in the population.


The cost of studying all the items in a population.
The sample results are usually adequate.
Contacting the whole population would often be time-consuming.

EduPristine | Business Analytics 3


Central Limit Theorem

X
For a population with a mean and a variance 2 the sampling distribution of the means of all possible
samples of size n generated from the population will be approximately normally distributed.
The mean of the sampling distribution equal to and the variance equal to 2/n.
How is variance related to standard error?

As sample size gets large (typically > 30)


Sampling distribution becomes almost normal regardless of shape of population
4
EduPristine | Business Analytics
Central Limit Theorem
Introduction:
It is perhaps one of the most important result in statistics
It provides the basis for large-sample inference about a population mean when the population
distribution is unknown.
It also provides the basis for large-sample inference about a population proportion, for example, in
opinion polls and surveys.
Definition:
If X1, X2, .,Xn is a sequence of independent, identically distributed (iid) random variables with finite
mean and finite (non-zero) variance 2 then the distribution of (<X> )/( /n) approaches the
standard normal distribution, N(0,1) , as n
is the population mean from which X1, X2, .,Xn have been extracted.
i=n
<X> is the sample mean calculated as <X> = (1/n) i=1
Xi
For large n, (<X> )/( /n) and ( Xi n )/((n 2)) has N(0, 1) distribution
OR
<X> ~ N(, 2/n)
Xi ~ N(n , n 2)

EduPristine | Business Analytics 5


The Central Limit Theorem
Example:
It is assumed that the number of claims arriving at an insurance company per working day has a
mean of 40 and a standard deviation of 12. A survey was conducted over 50 working days. Find the
probability that the sample mean number of claims arriving per working day was less than 35.
Solution:
We have, = 40, = 12 , n = 50 .
2
The central limit theorem states that <X) ~ N(40,12 /50) .
We want P( <X> < 35) :
2
P( <X> < 35) = P(Z < (35-40)/ (12 /50))
= P(Z < -2.946) = 1 P(Z < 2.946)
= 1 0.9984 = 0.0016

EduPristine | Business Analytics 6


Sampling Error
The sampling error is the difference between a sample statistic and its corresponding population
parameter. It is found by subtracting the value of a Parameter from the value of a Statistic.

For example, if a poll was conducted where the population included all students in that school and the
sample was a class. If the sample had a mean GPA of 3.4, and the populations mean GPA was 3.2, then the
sample error was 0.2.

7
EduPristine | Business Analytics
Standard Error of sample mean
It is the standard deviation of the distribution of the sample means

When the standard deviation of the population is known, the standard error of the sample mean is
calculated as:

Standard error of sample mean = Standard deviation of population


Square root of the sample size (n)
Example: The mean hourly wage for Housten farm workers is $13.50 with a population standard deviation
of $2.90. Calculate & interpret the standard error of the sample mean for a sample size of 30

Answer: Because the population standard deviation is known, the standard error of the sample mean is
expressed as = $2.90/ root of (30) = $0.53

8
EduPristine | Business Analytics
Point Estimate & Confidence Interval
Point estimates: These are the single (sample) values used to estimate population parameters
Confidence interval: It is a range of values in which the population parameter is expected to lie
Confidence interval takes on the following form where N 30

CI = + Z*
True for a population distribution
Where, is the mean of the population
is the standard deviation of the population
For a sample mean,
Point estimate + (reliability factor * standard error )
CI = + Z*(/n)

EduPristine | Business Analytics 9


Hypothesis Testing

A statistical hypothesis test is a method of making statistical decisions from and about
experimental data.

Null-hypothesis testing answers the question:

How well the findings fit the possibility that chance factors alone might be responsible."

Example: Does your score of 6/10 imply that I am a good teacher???

EduPristine | Business Analytics 10


Key steps in Hypothesis Testing

Null Hypothesis (H0): The hypothesis that the researcher wants to reject

Alternate Hypothesis(Ha): The hypothesis which is concluded if there is sufficient evidence to


reject null hypothesis

Test Statistic

Rejection/Critical Region

Conclusion

EduPristine | Business Analytics 11


Launching a niche course for MBA students?
Sam, a brand manager for a leading financial training center, wants to introduce a new niche finance course for MBA
students. He met some industry stalwarts and found that with the skills acquired by attending such a course, the
students would able to land up a in a good job.
He meets a random sample of 100 students and discovers the following characteristics of the market
Mean household income to $20,000
Interest level in students = high
Current knowledge of students for the niche concepts = low
Sam strongly believes the course would adequately profitable in students if they have the buying power for the
course. They would be able to afford the course only if the mean household income is greater than $19,000.
Would you advice Sam to introduce the course?
What should be the hypothesis?
o Hint: What is the point at which the decision changes (19,000 or 20,000)?
o What about the alternate hypothesis?
What other information do you need to ensure that the right decision is arrived at?
o Hint: confidence intervals/ significance levels?
o Hint: Is there any other factor apart from mean, which is important? How do I move from population parameters to
standard errors?
What is the risk still remaining, when you take this decision?
o Hint: Type-I/II errors?
o Hint: P-value

EduPristine | Business Analytics 12


Criterion for Decision Making

To reach a final decision, Sam has to make a general inference (about the population) from the
sample data.

Criterion: Mean income across all households in the market area under consideration.
If the mean population household income is greater than $19,000, then PD should introduce the product
line into the new market.

Sams decision making is equivalent to either accepting or rejecting the hypothesis:


The population mean household income in the new market area is greater than $19,000

The term one-tailed signifies that all z-values that would cause Sam to reject H0, are in just one
tail of the sampling distribution
-> Population Mean
H0: $19,000
Ha: $19,000

13
EduPristine | Business Analytics
Identifying the Critical Sample Mean Value Sampling Distribution
0.25

0.2

0.15
Critical Value
0.1 (Xc)

0.05

0
-10 -5 $19,000
0 5 10

Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling
distribution centered on = $19,000--suggest that H0 may be false.
More important the farther to the right x is , the stronger is the evidence against H0

Reject H0 if the sample mean exceeds Xc

14
EduPristine | Business Analytics
Computing the Criterion Value

Standard deviation for the sample of 100 households is $4,000. The standard error of the mean (sx)
is given by:
s
sx $400
n
Critical mean household income xc through the following two steps:
Determine the critical z-value, zc. For =0.05:
zc = 1.645.

Substitute the values of zc, s, and (under the assumption that H0 is "just" true )
Critical Value xc
xc = + zcs = $19,658.

In this case, since the observed sample statistic (20,000) is greater than the critical value (19,658), so the null
hypothesis is rejected =>

Decision Rule
If the sample mean household income is greater than $19,658, reject the null hypothesis and
introduce the new course
15
EduPristine | Business Analytics
Test Statistic

The value of the test statistic is simply the z-value corresponding to = $20,000.

x 0.25
Z 2.5
sx 0.2

Here, sx is the standard error


0.15

0.1 = 0.05
0.05

0
There is a significant difference in the -10 x=
5 $ 20,000
hypothesized population parameter and
-5 =$19,000
0 10

the observed sample statistic => Z=0 Z=2.5


Mean income > 19,000 =>
Launch the course Do not Reject H0 Reject H0
X c $19,658
Z c 1.645

16
EduPristine | Business Analytics
Errors in Estimation

Please note: You are inferring for a population, based only on a sample
This is no proof that your decision is correct
Its just a hypothesis
Actual
There is still a chance that your inference is wrong
H0 is True H0 is False
How do I quantify the prob. of error in inference?
Inference
Type I and Type II Errors:
Type I error occurs if the null hypothesis is H0 is True
Correct Decision Type-II Error P(Type-II
rejected when it is true Confidence Level=1- Error)=
Type II error occurs if the null hypothesis is
not rejected when it is false H0 is False Power=1-
Type-I Error Significance
Level=
Significance Level:
-> Significance level : The upper-bound probability of a Type I error
1 - ->confidence level : The complement of significance level
The power of a test is the probability of correctly rejecting the null.

EduPristine | Business Analytics


P - Value Actual Significance Level
The p-value is the smallest level of
significance at which the null
hypothesis can be rejected. 0.25

P-value 0.2

The probability of obtaining an observed


value of x (From the sample) as high as 0.15
$20,000 or more when actual
populations mean () is only $19,000 =
0.00621
0.1 = 0.05
Calculated probability of rejecting the 0.05
null hypothesis (H0) when that
hypothesis (H0) is true (Type I error)
0
The actual significance level of =$19,000 p-value= 0.00621
0.00621 in this case means that the Z=0
odds are less than 62 out of 10,000
that the sample mean income of Do not Reject H0 Reject H0
$20,000 would have occurred entirely
due to chance (when the population
mean income is $19,000)
18
EduPristine | Business Analytics
Some variations in the Z-Test
What if Sam surveyed the market and found that the student behavior is estimated to be:
They would found the training too expensive if their household income is < US$ 19,000 and hence would not
have the buying power for the course?
They would perceive the training to be of inferior quality, if their household income is > US$19,000 and hence
not buy the training?
How would the decision criteria change? What should be the testing strategy?
Hint: From the question wording infer: Two tailed testing
Appropriately modify the significance value and other parameters
Use the Z-test
Appropriate change in the decision making and testing process process:
Students will not attend the course if:
The household income >$19,000 and the students perceive the course to be inferior
The household income is <$19,000
This becomes a two tailed test wherein the student will join the course only when the household lie between
a particular boundary. i.e. the household income should be neither very high neither very low

EduPristine | Business Analytics


Two- Tailed Test

Now the test is modified to two-tailed test,


0.25
which signifies that all z-values that would
cause PD to reject H0, are in both the tails
0.2
of the sampling distribution
-> Population Mean
0.15
H0: = $19,000
Ha: $19,000 = 0.025
Since we are checking for significance
0.1 = 0.025
difference on both the ends, so its a two
0.05
tailed test
The lower boundary = 0

Z / 2 * 19,000 1.95 * 400 $18,216


-10 -5 =$19,000 10

Z=0
Z / 2 * 19,000 1.95 * 400 $19,784
Reject H0 Reject H0
Conclusion: If the household income lies Do not
between $18,216 and $19,784 then the Reject H0
student will attend the course at 95%
confidence

20
EduPristine | Business Analytics
Business Analytics
Predictive Modeling using Linear
Regression

EduPristine www.edupristine.com
EduPristine | Business Analytics
Agenda

Basic Statistics

Predictive modeling using Linear Regression

EduPristine | Business Analytics 22


4. Correlation and Regression
I. Covariance and Correlation coefficient

II. Regression

EduPristine | Business Analytics 23


4a. Correlation
I. Covariance and Correlation coefficient
i. Definition
ii. Sample and population correlation
iii. Illustrative example
iv. Statistical significance test for sample correlation coefficient

EduPristine | Business Analytics 24


4a. Covariance and Correlation Coefficient
Covariance is a statistical measure of the degree to which the two variables move together.
The sample covariance is calculated as
n

(X i X )(Y iY )
cov xy i 1
n 1
Correlation coefficient
It is a measure of the strength of the linear relationship between two variables
The correlation coefficient is given by:
cov xy
xy
x y
Population correlation is denoted by (rho)
Sample correlation is denoted by r. It is an estimate of same way as
S2 (sample variance) is an estimate of 2 (population variance) and
X (sample mean) is an estimate of (population mean)
Features of and r
Unit free and ranges between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship

EduPristine | Business Analytics 25


4a. Example: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample

Closing Index Value


Date S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93
12/5/2011 1,257.08 2,655.76
12/7/2011 1,261.01 2,649.21
12/8/2011 1,234.35 2,596.38
12/9/2011 1,255.19 2,646.85
12/12/2011 1,236.47 2,612.26

EduPristine | Business Analytics 26


4a. Solution: Covariance and Correlation of the S&P 500 and
NASDAQ Returns given a sample

Closing Index Value Returns Deviation

Date S&P 500 NASDAQ S&P 500 NASDAQ S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93 Xi Yi Xi- X Yi- Y (Xi-X )*(Yi- Y )
12/5/2011 1,257.08 2,655.76 1.03% 1.10% 1.14% 1.20% 0.0137%
12/7/2011 1,261.01 2,649.21 0.31% -0.25% 0.43% -0.15% -0.0006%
12/8/2011 1,234.35 2,596.38 -2.11% -1.99% -2.00% -1.89% 0.0378%
12/9/2011 1,255.19 2,646.85 1.69% 1.94% 1.80% 2.05% 0.0369%
12/12/2011 1,236.47 2,612.26 -1.49% -1.31% -1.38% -1.21% 0.0166%
X Y

Mean -0.12% -0.10% Total 0.1044%


sx sy
Standard
0.01630504 0.01633798
Deviation
Covariance 0.000261013
Correlation 0.979811179

EduPristine | Business Analytics 27


4a. Examples of Approximate r Values

y y y

x x x
r = -1 r = -0.6 r=0
y y

x x
r = +.3 r = +1
EduPristine | Business Analytics 28
4a. Testing the significance of the correlation coefficient

Test whether the correlation between the population of two variables is equal to zero
Null hypothesis, H0: r = 0
Assuming that the two populations are normally distributed, we can use a t-test to determine
whether the null hypothesis should be rejected.
The test statistic is computed using the sample correlation, r, with n 2 degrees of freedom (df )
t = r (n-2)
(1- r2)
Calculated test statistic is compared with the critical t-value for the appropriate degrees of
freedom and level of significance
Reject H0 if t > tcritical or t <-tcritical

Example: Correlation of the S&P 500 and NASDAQ Returns given a sample
n = 5, r = 0.979811179, v = 5-2 = 3
Calculate, t = 8.4885
tcritical at 95% confidence interval (df = 3) = 2.3534
Hence, reject H0 at CI of 95%

EduPristine | Business Analytics 29


4b. Regression

I. Explain what is meant by response and explanatory variables.


II. State the usual simple regression model (with a single explanatory variable).
III. State and explain the least squares estimates of the slope and intercept parameters in a simple
linear regression model.
IV. Calculate R2 (coefficient of determination) and describe its use to measure the goodness of fit of
a linear regression model.
V. Use a fitted linear relationship to predict a mean response or an individual response with
confidence limits.
VI. Use residuals to check the suitability and validity of a linear regression model.
VII. State the usual multiple linear regression model
VIII. Discuss about issues in linear regression
i. Heteroskedasticity
ii. Multicollinearity
IX. Detailed case study on multivariate regression by using
I. MS Excel
II. R software

EduPristine | Business Analytics 30


4.b. The Million Dollar Question

If I study for five more hours will it actually increase my marks?

EduPristine | Business Analytics 31


4.b. The Population

Hours of Study Versus Marks


120

100
Marks in Test

80

60

40

20

0
0 10 20 30 40 50 60 70 80 90

Hours of Study

Can we draw a trend line to predict this relationship?

EduPristine | Business Analytics 32


4.b. Introduction to Regression Analysis

Regression analysis is used to:


Predict the value of a dependent variable based on the value of at least one independent
variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to explain usually denoted by Y

Independent variable: the variable used to explain the dependent variable. Denoted by X

EduPristine | Business Analytics 33


4.b. Simple Linear Regression Model

Only one independent variable, x


Relationship between x and y is described by a linear function
Changes in y are assumed to be caused by changes in x

EduPristine | Business Analytics 34


4.b. Assumptions
1. A linear relationship exists between the dependent and the independent variable.

2. The independent variable is uncorrelated with the residuals.

3. The expected value of the residual term is zero


E( ) 0
4. The variance of the residual term is constant for all observations (Homoskedasticity)

E
i
2 2

5. The residual term is independently distributed; that is, the residual for one observation is not
correlated with that of another observation

[E( i j ) 0, j i]
6. The residual term is normally distributed.

EduPristine | Business Analytics 35


4.b. Types of Regression Models

Negative Linear Relationship Relationship NOT Linear

Positive Linear Relationship No Relationship

EduPristine | Business Analytics 36


4.b. Population Linear Regression

(continued)

Y Y 0 1 X u

ui Slope = 1

Predicted Value Random Error for this x value


of Y for Xi

Intercept = 0 Individual
person's marks

xi X
37
EduPristine | Business Analytics
4.b. Population Regression Function

Random Error
Dependent Population y Population Slope Independent term, or
Variable intercept Coefficient Variable residual

Y 0 1 X u
Linear component Random Error
component

But can we actually get this equation?


If yes what all information we will need?

EduPristine | Business Analytics 38


4.b. Information that we actually have

EduPristine | Business Analytics 39


4.b. Sample Regression Function

(continued)

Y y b 0 b1 x e
Observed Value
of y for xi

ei Slope = 1

Predicted Value Random Error for this x value


of Y for Xi

Intercept = 0

xi X
40
EduPristine | Business Analytics
4.b. Sample Regression Function

Estimate of the Estimate of the


regression intercept regression slope
Independent
variable

yi b 0 b1x e Error term

Notice the similarity with the Population Regression Function

Can we do something of the error term?

EduPristine | Business Analytics 41


4.b. The error term (residual)
Represents the influence of all the variable which we have not accounted for in the equation
It represents the difference between the actual "y" values as compared the predicted y values
from the Sample Regression Line
Wouldn't it be good if we were able to reduce this error term?
What are we trying to achieve by Sample Regression?

EduPristine | Business Analytics 42


4.b. Our Objective

Y 0 1 X u

To Predict PRL from


SRL

yi b 0 b1x

EduPristine | Business Analytics 43


4.b. One method to find b0 and b1

Method of Ordinary Least Squares (OLS)


b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared
residuals


e 2
(y y) 2

(y (b 0 b 1 x)) 2

Are there any advantages of minimizing the squared errors?


Why don't we take the sum?
Why don't we take absolute values instead?

EduPristine | Business Analytics 44


4.b. OLS Regression Properties

The sum of the residuals from the least squares regression line is 0.
( y y ) 0
The sum of the squared residuals is a minimum.
Minimize ( ( y
y ) 2
)

The simple regression line always passes through the mean of the y variable and the mean of
the x variable

The least squares coefficients are unbiased estimates of 0 and 1

EduPristine | Business Analytics 45


4.b. Interpretation of the Slope and the Intercept

b0 is the estimated average value of y when the value of x is zero. More often than not it does
not have a physical interpretation
b1 is the estimated change in the average value of y as a result of a one-unit change in x

y

Y b0 b1 X
slope of the line(b1)

b0

EduPristine | Business Analytics 46


4.b. Hypothesis Testing: Two Variable Model
How do we know whether the values of b0 and b1 that we have found are actually meaningful?
Is it actually possible that our sample was a random sample and it has given us a totally wrong
regression line?
We do know a lot about the sample error term "e" but what do we know about the error terms
"u" of the Population Regression Function?
How do we proceed from here?

EduPristine | Business Analytics 47


4.b. Assumptions about "u"
The underlying relationship between Y
the X variable and the Y variable is linear
Cov(ex1,ex2) = 0
For a given value of Xi the sum of error
terms is equal to 0
e
The error term is uncorrelated with
the explanatory variable X
Error values are normally distributed
for any given value of X
The probability distribution of the errors
for a given Xi is normal
The probability distribution of the errors
for different Xi has constant variance x1 x2 X
(homoscedasticity)
Error values u for given Xi are statistically independent, their covariance is zero

Once we make these assumptions about "u" we are able to estimate the variance and standard
errors of b0 and b1 and this has been possible because of the properties
of OLS method (beyond the scope of lecture)

EduPristine | Business Analytics 48


4.b. Standard Error of Estimate (SEE)
The standard deviation of the variation of observations around the regression line is estimated
by:

RSS
su
n k 1
Where
RSS= Residual Sum of Squares (summation of e2)
n = Sample size
k = number of independent variables in the model
Note: When k=1

RSS
su = Sample standard error of the estimate
n2

EduPristine | Business Analytics 49


4.b. Comparing Standard Errors
Variation of observed y values from the regression Variation in the slope of regression lines from
line different possible samples

y y

small s u x smallsb1 x

y y

large s u x large sb1 x

EduPristine | Business Analytics 50


4.b. Inference about the Slope: t-Test
t-test for a population slope
Is there a linear relationship between x and y?
Null and alternative hypotheses
H0: 1 = 0 (no linear relationship)
H1: 1 0 (linear relationship does exist)
Test statistic
b1 1
t
sb1
d.f. n 2
The null hypothesis can be rejected if either of the following are true:
tc <t or
t < -tc
where:
b1 = Sample regression slope coefficient
1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
tc= the critical t value

EduPristine | Business Analytics 51


4.b. Confidence Interval for 'y'
The confidence interval for the predicted value of Y is given by:

Y (tc * s f )
where:
Y = predicted 'Y' value (dependent variable)
n-2 = degrees of freedom
tc = the critical t value
s f = the standard error of the forecast

* SE of forecast is NOT Standard Error of Coefficient Estimate or Standard Error of Estimate

EduPristine | Business Analytics 52


4.b. The Confidence Interval for a Regression Coefficient
The confidence interval for the regression coefficient, b1 is given by:

b1 (tc * sb )
1

where:
b1 = correlation between x and y
n-2 = degrees of freedom
tc = the critical t value

sb = the standard error of the regression coefficient
1

EduPristine | Business Analytics 53


4.b. Explained and Unexplained Variation

y
yi SSE = Sum of squared errors

y
SST = Total Sum of Squares SSE = (yi - yi )2
_
SST = (yi - y)2

y _ 2
_ RSS = (yi - y) _
y RSS = Regression sum of squares
y

Xi x

EduPristine | Business Analytics 54


4.b. Explained and Unexplained Variation (Cont)

SST = Total sum of squares


Measures the variation of the yi values around their mean y (continued)
SSE = Sum of squared errors
Variation attributable to factors other than the relationship between x and y
SSR = Regression sum of squares
Explained variation attributable to the relationship between x and y

EduPristine | Business Analytics 55


4.b. Explained and Unexplained Variation (Cont)

Total variation is made up of two parts:

SST SSE RSS


Total sum of Regression Sum of Squares
Sum of Squared Errors
Squares Also known as
Square Sum of Regression SSR

SST ( y y ) 2
SSE ( y y ) 2 SSR ( y y ) 2

Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
y = Estimated value of y for the given x value

EduPristine | Business Analytics 56


4.b. Coefficient of Determination, R2

The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2

SSR 0 R 1 2
R 2
where
SST

EduPristine | Business Analytics 57


4.b. Coefficient of Determination, R2 (Cont)

Coefficient of determination
(continued)
SSR sum of squaresexplained by regression
R 2

SST total sum of squares

Note: In the single independent variable case, the coefficient of determination is

Where: R2 r 2
R2 = Coefficient of determination
r = Simple correlation coefficient

EduPristine | Business Analytics 58


4.b. Examples of Approximate R2 Values

y
R2 = 1

Perfect linear relationship between x and y:

x 100% of the variation in y is explained by variation in x


R2 =1
y

x
R2 = +1

EduPristine | Business Analytics 59


4.b. Examples of Approximate R2 Values (Cont)

y
0 < R2 < 1

Weaker linear relationship between x and y:

Some but not all of the variation in y is explained by


x variation in x

EduPristine | Business Analytics 60


4.b. Examples of Approximate R2 Values (Cont)

y R2 = 0

No linear relationship between x and y:

The value of Y does not depend on x. (None of the


variation in y is explained by variation in x)

R2 = 0 x

EduPristine | Business Analytics 61


4.b. Limitations of Regression Analysis
Parameter Instability - This happens in situations where correlations change over a period of
time. This is very common in financial markets where economic, tax, regulatory, and political
factors change frequently.
Public knowledge of a specific regression relation may cause a large number of people to react in
a similar fashion towards the variables, negating its future usefulness.
If any regression assumptions are violated, predicted dependent variables and hypothesis tests
will not hold valid.

EduPristine | Business Analytics 62


4.b. General Multiple Linear Regression Model
In simple linear regression, the dependent variable was assumed to be dependent on only one
variable (independent variable)
In General Multiple Linear Regression model, the dependent variable derive sits value from two or
more than two variable.
General Multiple Linear Regression model take the following form:

Yi b0 b1 X 1i b2 X 2i ......... bk X ki i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
i = error term of ith observation
n = number of observations
k = total number of independent variables

EduPristine | Business Analytics 63


4.b. Estimated Regression Equation
As we calculated the intercept and the slope coefficient in case of simple linear regression by
minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in
multiple linear regression.
n
Sum of Squared Errors i
2

i 1
is minimized and the slope coefficient is estimated.

The resultant estimated equation becomes:



Yi b0 b1 X 1i b2 X 2i ......... bk X ki
Now the error in the ith observation can be written as:



i Yi Yi Yi b0 b1 X 1i b2 X 2i ......... bk X ki

EduPristine | Business Analytics 64


4.b. Interpreting the Estimated Regression Equation
Intercept Term (b0): It's the value of dependent variable when the value of all independent
variables become zero.

b0 Value of Y
when X 1 X 2 ....... X k 0
Slope coefficient (bk): It's the change in the dependent variable from a unit change in the
corresponding independent (Xk) variable keeping all other independent variables constant.
In reality when the value of the independent variable changes by one unit, the change in the dependent
variable is not equal to the slope coefficient but depends on the correlation among the independent
variables as well.

Therefore, the slope coefficient are called partial slope coefficients as well

EduPristine | Business Analytics 65


4.b. Assumptions of Multiple Regression Model
There exists a linear relationship between the dependent and independent variables.

The expected value of the error term, conditional on the independent variables is zero.

The error terms are homoskedastic, i.e. the variance of the error terms is constant for all the
observations.

The expected value of the product of error terms is always zero, which implies that the error
terms are uncorrelated with each other.

The error term is normally distributed.

The independent variables doesn't have any linear relationships between each other.

EduPristine | Business Analytics 66


4.b. Hypothesis Testing of Coefficients
The values of the slope coefficients doesn't tell anything about their significance in explaining the
dependent variable.
Even an unrelated variable when regressed would give some value of slope coefficients.
To exclude the cases where the independent variables doesn't significantly explain the dependent
variable, we need the hypothesis testing of the coefficients for checking whether they contribute
in explaining the dependent variable significantly or not.
The t-statistic is used to check the significance of the coefficients.
The t-statistic used for the hypothesis testing is same as used in the hypothesis testing of
coefficient of simple linear regression.
Following are the hypothesis and alternative hypothesis to check the statistical significance of b k:
Hypothesis H0: bk =0
Alternative Hypothesis (Ha): bk 0
The t-statistic of (n-k-1) degrees of freedom for the hypothesis testing of the coefficient bk

bk bk
t
s
bj

If the value of t-statistic lies within the confidence interval, H0 can't be rejected

EduPristine | Business Analytics 67


4.b. Confidence Interval for the Population Value
The confidence interval for a regression coefficient is given by:

b j (tc s )
bj

Where,
tc is the critical t-value, and
sb is the standard error
j

EduPristine | Business Analytics 68


4.b. Predicted Dependent Variable
The regression equation can be used for making predictions about the dependent variable by
using forecasted values of the independent variables.

Yi b0 b1 X 1i b2 X 2i ......... bk X ki

Where,
Y is the predicted value for the dependent variable
i
bi is the estimated partial slope for the ith independent variable
X ni is the forecasted ith value for the nth independent variable

EduPristine | Business Analytics 69


4.b. Analysis of Variance (ANOVA)
Analysis of variance is a statistical method for analyzing the variability of the data by breaking the
variability into its constituents.
A typical ANOVA table looks like:
Source of Variability DoF Sum of Squares Mean Sum of Squares
Regression(Explained) k RSS MSR=RSS/1
Error(Unexplained) n-k-1 SSE MSE=SSE/n-2
Total n-1 SST=RSS+SSE

From the above summary(ANOVA table) we can calculate:

SSE
Standard Error of Estimate(SEE) = MSE
n2
Total Variation( SST) Unexplaine d Variation( SSE)
Coefficient of determination(R2) =
Total Variation( SST)
Explained Variation( RSS)
=
Total Variation( SST)

EduPristine | Business Analytics 70


4.b. F-Statistic
An F-test explains how well the dependent variable is explained by the independent variables
collectively.
In case of multiple independent variable, F-test tells us whether a single variable explains a
significant part of the variation in dependent variable or all the independent variables explain the
variability collectively.

F-statistic is given as:


Where:

MSR: Mean Regression sum of squares

MSE: Mean Squared Error

n: Number of observations

k: Number of independent variables


RSS
MSR k
F
MSE SSE
n k 1
EduPristine | Business Analytics 71
4.b. F-statistic contd.
Decision rule for F-test: Reject H0 if the F-statistic > Fc (Critical Value)
The numerator of F-statistic has degrees of freedom of "k" and the denominator has the degrees
of freedom of "n-k-1"
If H0 is rejected then at least one out of two independent variable is significantly different that
zero.
This implies that at least one out of household income(independent variable) or household
expenses(independent variable) explains the variation in the pocket money of Edward.

F-test is always a single tailed test while testing the hypothesis


that the coefficients are simultaneously equal to zero
EduPristine | Business Analytics 72
4.b. Coefficient of determination (R2) and Adjusted R2
Coefficient of determination(R2) can also be used to test the significance of the coefficients
collectively apart from using F-test.
SST - SSE RSS Sum of Squares explained by regression
R2
SST SST Total Sum of Squares
The drawback of using Coefficient of determination is that the value of the coefficient of
determination always increases as the number of independent variables are increased even if the
marginal contribution of the incoming variable is statistically insignificant.
To take care of the above drawback, coefficient of determination is adjusted for the number of
independent variables taken. This adjusted measure of coefficient of determination is called
adjusted R2
Adjusted R2 is given by the following formula:

n 1 2
where R 2
a 1 n k 1 1 R

n = Number of Observations
k = Number of Independent Variables
Ra2 = Adjusted R2

EduPristine | Business Analytics 73


4.b. Representing Qualitative Factors
How can we represent Qualitative factors in a regression equation?
By using 'dummy variables'; variables that take values of either 1 or 0, depending whether it is
true or false.

If we wanted to consider the spike in soft drink sales in the summer, we may have a regression
equation:
Rev(t) 10,000 2,000t 50,000S

Here,
1 if it' s summer
S
0 if it' s not summer
If there are n mutually exclusive and exhaustive classes, they can be represented by n-1 dummy
variables. This is derived from the concept of degrees of freedom.
For example, to represent the 4 stages of the business cycle, we can use 3 dummy variables.
The fourth variable would be represented by zeros for all three dummy variables.
We do not use 4 variables as that would indicate a linear relationship between all 4 variables.

EduPristine | Business Analytics 74


4.b. Multicollinearity
Another significant problem faced in the Regression Analysis is when the independent variables or
the linear combinations of the independent variables are correlated with each other.
This correlation among the independent variables is called Multicollinearity which creates
problems in conducting t-statistic for statistical significance.
Multicollinearity is evident when the t-test concludes that the coefficients are not statistically
different from zero but the F-test is significant and the coefficient of determination (R2) is high.
High correlation among the independent variables suggests the presence of multicollinearity but
lower values of correlations doesn't omit the chances of presence of multicollinearity.
The most common method of correcting multicollinearity is by systematically removing the
independent variable until multicollinearity is minimized.

Presence of Multicollinearity leads to TYPE-II errors

EduPristine | Business Analytics 75


4.b. Model Misspecifications
Apart from checking the previously discussed problems in the regression, we should check for the
correct form of the regression as well.
Following 3 misspecification can be present in the regression model:
Functional form of regression is misspecified:
The important variables could have been omitted from the regression model
Some regression variables may need the transformation (like conversion to the logarithmic scale)
Pooling of data from incorrect pools
The variables can be correlated with the error term in time-series models:
Lagged dependent variables are used as independent variables with serially correlated errors
A function of dependent variables is used as an independent variable because of incorrect dating of the
variables
Independent variables are sometimes measured with error
Other Time-Series Misspecification which leads to the nonstationarity of the variables:
Existence of relationships in time-series that results in patterns
Random walk relationships among the time series
These misspecifications in the regression model results in the biased and inconsistent regression
coefficients which further leads to incorrect confidence intervals leading to TYPE-I or TYPE-II
errors.

Nonstationarity means that the properties(like mean, variance) of the variables is not constant

EduPristine | Business Analytics 76


4.b. The Economic meaning of a Regression Model
Consider the equation:
Rev_Growth 4% 0.75GDP_Growth 0.5WPI_Infl

The economic meaning for this equation is given by the partial slopes or coefficients of the
variables.
If the GDP Growth rate was 1% higher, it translates into a 0.75% higher Revenue growth.
Similarly, if the WPI Inflation figures were 1% higher, it translates into a 0.5% higher revenue
growth.

EduPristine | Business Analytics 77


4.b. Case- Multivariate Linear Regression (Revisited)

Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data
having Loss amount and policy related information and asked him to identify and quantify the
factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a
multivariate regression.
Now suppose, he approaches you and request for your help to complete the assignment. Lets help
Adam in carrying out the multivariate regression.

EduPristine | Business Analytics 78


Case- Multivariate Linear Regression (Rules of Thumb)
In due course of helping Adam to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Percentiles
Variance
Frequency distribution
Outlier treatment
Identify the outliers/threshold limit
Cap/floor the values at the thresholds
Independent variables analyses
Identify the prospective independent variables (that can explain response variable)
Bivariate analysis of response variable against independent variables
Variable treatment /transformation
Grouping of distinct values/levels
Mathematical transformation e.g. log, splines etc.

EduPristine | Business Analytics 79


Case- Multivariate Linear Regression (Rules of Thumb)
Heteroskedasticity
Check in a univariate manner by individual variables
Easy for univariate linear regression. Can be done manually.
Too cumbersome to do manually for multivariate case
The tools (R, SAS etc.) have in-built features to tackle it.
Fitting the regression
Check for correlation between independent variables
This is to take care of Multicollinearity
Fix Heteroskedasticty
By suitable transformation of response variable a bit tricky).
Using inbuilt features of statistical packages like R
Variable selection
Check for the most suitable transformed variable
Select the transformation giving the best fit
Reject the statistically insignificant variables
Fitting the regression
Analysis of results
Model comparison
Model performance check
R2
Lift/Gains chart and Gini coefficient
Actual vs Predicted comparison
EduPristine | Business Analytics 80
Multivariate Linear Regression- Data
Snapshot of the data
Data description (known facts):
Auto insurance policy data
Contains policy holders and loss amount
information (variables)
Policy Number
Age
Years of Driving Experience
Number of Vehicles
Gender
Married
Vehicle Age
Fuel Type
Losses (Dependent/Response Variable)
Next step
Create the Data Dictionary

EduPristine | Business Analytics 81


Multivariate Linear Regression- Data Dictionary

Sl # Variable Name Variable Description Values Stored Variable Type

1 Policy Number Unique Policy Number ? ?

2 Age Age of Policy holder ? ?

Years of Driving Years of Driving Experience of the


3 ? ?
Experience Policy holder
Number of Vehicles insured under
4 Number of Vehicles ? ?
the policy

5 Gender Gender of the Policy holder ? ?

6 Married Marital status of the Policy holder ? ?

Age of vehicle insured under the


7 Vehicle Age ? ?
policy

8 Fuel Type Fuel type of the vehicle insured ? ?

Insurance amount claimed under


9 Losses ? ?
the policy

EduPristine | Business Analytics 82


Multivariate Linear Regression- Data Dictionary

Sl # Variable Name Variable Description Values Stored Variable Type


Unique value identifying
1 Policy Number Unique Policy Number Identifier
the policy

2 Age Age of Policy holder 16, 17,,70 Numerical (Discrete)

Years of Driving Years of Driving Experience of the


3 0,1,.,53 Numerical (Discrete)
Experience Policy holder
Number of Vehicles insured under
4 Number of Vehicles 1,2,3,4 Numerical (Discrete)
the policy

5 Gender Gender of the Policy holder F, M Categorical (Binary)

6 Married Marital status of the Policy holder Married, Single Categorical (binary)

Age of vehicle insured under the


7 Vehicle Age 0,1,,15 Numerical (Discrete)
policy

8 Fuel Type Fuel type of the vehicle insured D, P Categorical (Binary)

Loss amount claimed under the


9 Losses Range: 13- 3500 Numerical (Continuous)
policy

EduPristine | Business Analytics 83


Multivariate Linear Regression- Response Variable (Losses)
Distribution: Losses- Scatter Plot
Distribution Value 4000

Min 13 3500
1 1 Outliers
1
5th 67 3000
10th 122 2500
25th 226
2000
50th 355
1500
75th 399
90th 685 1000

95th 821 500


97.5th 981 0
99th 1,204 0 2000 4000 6000 8000 10000 12000 14000

99.50th 1,366
Max 3,500
Mean 390
Stddev 254

EduPristine | Business Analytics 84


Multivariate Linear Regression- Response Variable (Losses)
Distribution:
Capped Losses- Scatter Plot
Distribution Losses Capped Loss
1200
Min 13 13
1000
5th 67 67
10th 122 122 800

25th 226 226 600


50th 355 355
400
75th 399 399
90th 685 685 200

95th 821 821 0


97.5th 981 981 0 2000 4000 6000 8000 10000 12000 14000

99th 1,204 1,200


Capped Losses- Grouped Freq Dist
99.50th 1,366 1,200
1000
Max 3,500 1,200
900
Mean 390 386 800
1 700
Stddev 254 229
600
500
400
300
200
100
0

EduPristine | Business Analytics 85


Multivariate Linear Regression- Bivariate Profiling
Age
600 4%

500
3%
400

% Policies
Loss

300 2%

200
1%
100

- 0%
0 10 20 30 40 50 60 70 80
Age
% Obs Average Loss Average Capped Losses

Age Band
600 50%

500 40%
400

% Policies
30%
Loss

300
20%
200

100 10%

- 0%
16-25 26-59 60+
Age Band
% Policies Average Loss Average Capped Loss
EduPristine | Business Analytics 86
Multivariate Linear Regression- Bivariate Profiling

Years of Driving Experience


600 10.0%

9.0%
500
8.0%

7.0%
400
6.0%

% Policies
Loss

300 5.0%

4.0%
200
3.0%

2.0%
100
1.0%

- 0.0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

Years of Driving Experience

% Obs Average Loss Average Capped Losses

EduPristine | Business Analytics 87


Multivariate Linear Regression- Bivariate Profiling

500
Gender 51%

400 51%

% Policies
300 50%
Loss

200 50%

100 49%

- 49%
F M
Gender
% Obs Average Loss Average Capped Losses

Marital Status
500 60%
450
400 50%
350 40%

% Policies
300
Loss

250 30%
200
150 20%
100 10%
50
- 0%
Married Single
Marital Status
% Obs Average Loss Average Capped Losses
EduPristine | Business Analytics 88
Multivariate Linear Regression- Bivariate Profiling
Vehicle Age
600 9%
8%
500
7%
400 6%

% Policies
5%
Loss

300
4%
200 3%
2%
100
1%
- 0%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vehicle Age
% Obs Average Loss Average Capped Losses

Vehicle Age Band


600 45%
40%
500
35%
400 30%

% Policies
25%
Loss

300
20%
200 15%
10%
100
5%
- 0%
0-5 6-10 11+
Vehicle Age Band
% Obs Average Loss Average Capped Losses
EduPristine | Business Analytics 89
Multivariate Linear Regression- Bivariate Profiling
# Vehicles
400 40%
35%
395
30%
25%

% Policies
390
Loss

20%
385 15%
10%
380
5%
375 0%
1 2 3 4
# Vehicles
% Obs Average Losses Average Capped Losses

Fuel Type
800 90%
700 80%

600 70%
60%
500

% Policies
50%
Loss

400
40%
300
30%
200 20%
100 10%
- 0%
D Fuel Type P

% Obs Average Loss Average Capped Losses


EduPristine | Business Analytics 90
Linear Regression- Preparing MS Excel
1 2 3

EduPristine | Business Analytics 91


Linear Regression- Using MS Excel (Demo.)
1 3 4

EduPristine | Business Analytics 92


Multivariate Linear Regression- Variable Selection
Variable selection to be done on the basis of
Multicollinearity (correlation between independent variables)
Banding of variables e.g. whether to use Age or Age Band (also called custom bands)
Statistical significance of variables tested after performing above two steps
List of independent variables:
1. Age
2. Age Band
3. Years of Driving Experience
4. Number of Vehicles
5. Gender
6. Married
7. Vehicle Age
8. Vehicle Age Band
9. Fuel Type

EduPristine | Business Analytics 93


Multivariate Linear Regression- Variable Selection (Multicollinearity)
Age and Years of Driving Experience are highly correlated (Correlation Coefficient = 0.9972). We can
use either of the variables in regression
Q: Which one to use and which one to reject?
Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R 2 in this
case). The variable producing higher R2 gets accepted.

Regression Statistics (Age) Regression Statistics(Yrs Driving Experience)


Multiple R 0.475766 Multiple R 0.475273
R Square 0.226354 R Square 0.225885
Adjusted R Square 0.226303 Adjusted R Square 0.225834
Standard Error 201.2306 Standard Error 201.2916
Observations 15290 Observations 15290

R2 for Age > R2 for Years of Driving Experience


Reject Years of Driving Experience

EduPristine | Business Analytics 94


Multivariate Linear Regression- Custom Bands
Investigate whether to use Age or Age band
Fit regression independently using Age and Age Band
Before fitting regression, Age Band needs to be converted to numerical form from categorical. Replace
Age Band values with Average Age for the particular band.
Age Band Sum of Age # Policies Average Age
16-25 93,770.0 4,563.0 20.6
26-59 270,793.0 6,384.0 42.4
60+ 282,636.0 4,343.0 65.1
Regressions results using Age and Average Age
Regression Statistics (Age) Regression Statistics (Average Age)
Multiple R 0.475766 Multiple R 0.509969
R Square 0.226354 R Square 0.260068
Adjusted R Square 0.226303 Adjusted R Square 0.26002
Standard Error 201.2306 Standard Error 196.7971
Observations 15290 Observations 15290

R2 for Average Age > R2 for Age


Select Average Age
EduPristine | Business Analytics 95
Multivariate Linear Regression- Custom Bands
Investigate whether to use Vehicle Age or Vehicle Age band
Fit regression independently using Vehicle Age and Vehicle Age Band
Before fitting regression, Vehicle Age Band needs to be converted to numerical form from categorical.
Replace Vehicle Age Band values with Vehicle Average Age for the particular band.
Vehicle Age Band Sum of Vehicle Age # Policies Average Vehicle Age
0-5 9,229 3,688 2.50
6-10 44,298 5,523 8.02
11+ 78,819 6,079 12.97
Regressions results using Vehicle Age and Average Vehicle Age
Regression Statistics (Vehicle Age) Regression Statistics (Average Vehicle Age)
Multiple R 0.289431325 Multiple R 0.303099405
R Square 0.083770492 R Square 0.09186925
Adjusted R Square 0.083710561 Adjusted R Square 0.091809848
Standard Error 218.9903277 Standard Error 218.0203272
Observations 15290 Observations 15290

R2 for Average Vehicle Age > R2 for Vehicle Age


Select Average Vehicle Age
EduPristine | Business Analytics 96
Multivariate Linear Regression- Variable Selection
List of shortlisted variables:
1. Age Band in the form of Average Age of the band (selected out of Age and Age Band). Also got selected over
Years of Driving Experience.

2. Number of Vehicles
3. Gender

4. Married

5. Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle
Age Band).
6. Fuel Type

We will run regression in multivariate fashion and then select final list of variables by taking into
consideration statistical significance.

EduPristine | Business Analytics 97


Multivariate Linear Regression- Categorical variable conversion
Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)
1. Gender (F = 0 and M = 1)

2. Married (Married = 0 and Single = 1)

3. Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

EduPristine | Business Analytics 98


Multivariate Linear Regression- Output

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.865972274
R Square 0.749907979
Adjusted R Square 0.749809794
Standard Error 114.4310136
Observations 15290

ANOVA
df SS MS F Significance F
Regression 6 600073213.5 100012202.3 7637.751088 0
Residual 15283 200122584.4 13094.45688
Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 624.56529 5.29192 118.02233 0.00000 614.19249 634.93809 614.19249 634.93809
Avg Age -5.55974 0.06546 -84.93889 0.00000 -5.68804 -5.43144 -5.68804 -5.43144
Number of Vehicles 0.17875 0.97039 0.18420 0.85386 -1.72333 2.08082 -1.72333 2.08082
Gender Dummy 50.88326 1.89081 26.91084 0.00000 47.17705 54.58947 47.17705 54.58947
Married Dummy 78.39837 1.92148 40.80106 0.00000 74.63204 82.16469 74.63204 82.16469
Avg Vehicle Age -15.14220 0.26734 -56.63987 0.00000 -15.66623 -14.61818 -15.66623 -14.61818
Fuel Type Dummy 267.93559 2.74845 97.48614 0.00000 262.54830 273.32287 262.54830 273.32287
EduPristine | Business Analytics 99
Multivariate Linear Regression- Output
1 Independent Vars Coefficients(b)
Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938


X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

Insignificant
ANOVA
2 Significance F (from
df SS MS (SS/df) F (MSReg/MSRes)
F dist table)
Regression { (ypredictedl- ymean)2} p 6 600073213.5 100012202.3 7637.75 0
Residual {(yactual - ypredicted)2} n-p-1 15283 200122584.4 13094.457
Total {(yactual - ymean)2} n-1 15289 800195798
Regression Statistics
3
Multiple R SquareRoot(R Square) 0.8659723
R Square SS Regression/SS Total 0.7499080
Adjusted R Square R2 - (1 - R2)*{p/(n-p-1)} 0.7498098
Standard Error SquareRoot{SS Residual/(n-p-1)} 114.4310136
Observations n 15290
EduPristine | Business Analytics 100
Multivariate Linear Regression- Output (Significance Test)
1 Independent Vars Coefficients(b)
Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938


X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

Significance test of coefficients based on Normal distribution


H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)
H1: b is different than 0
Test statistic, Z = (b-0)/ (at 95% two tailed confidence interval, Z = 1.96)
Confidence interval = (b 1.96 * , b + 1.96 * )
For variable to be significant, the interval must not contain 0.

Example1: Avg Age.


Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)
No zero in the interval. Hence significant.

Example2: Number of Vehicles


Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)
Zero is present in the interval. Hence insignificant.
EduPristine | Business Analytics 101
Multivariate Linear Regression- Output (Significance Test)
1 Independent Vars Coefficients(b)
Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938


X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

Significance test of coefficients based on t distribution.


b/StdErr(b) ~ tn-2
H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression)
H1: b is different than 0
At 95% two tailed confidence interval and df greater that 120, t = 1.96)
Confidence interval = (b 1.96 * , b + 1.96 * )
For variable to be significant, the interval must not contain 0.

Example1: Avg Age.


Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)
No zero in the interval. Hence significant.

Example2: Number of Vehicles


Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)
Zero is present in the interval. Hence insignificant.
EduPristine | Business Analytics 102
Multivariate Linear Regression- Output at 95% Confidence Interval
SUMMARY OUTPUT

Excluding "Num Including "Num


Regression Statistics
Vehicles" Vehicles"
Multiple R 0.865971953 0.865972274
R Square 0.749907424 0.749907979
Adjusted R Square 0.749825608 0.749809794
Adjusted R-square improved
Standard Error 114.4273971 114.4310136
Observations 15290 15290

ANOVA
df SS MS F Significance F

5 600072769.2 120014553.8 9165.874 0


Regression
Residual 15284 200123028.7 13093.6292
Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 625.005 4.723 132.333 0.00 615.7474 634.2625 615.7474 634.2625
Avg Age -5.560 0.065 -84.942 0.00 -5.6879 -5.4314 -5.6879 -5.4314
Gender Dummy 50.883 1.891 26.912 0.00 47.1768 54.5890 47.1768 54.5890
Married Dummy 78.402 1.921 40.806 0.00 74.6356 82.1677 74.6356 82.1677
Avg Vehicle Age -15.142 0.267 -56.641 0.00 -15.6660 -14.6180 -15.6660 -14.6180
Fuel Type Dummy 267.935 2.748 97.489 0.00 262.5480 273.3223 262.5480 273.3223

EduPristine | Business Analytics 103


Multivariate Linear Regression- Regression Equation
Predicted Losses = 625.004932715948
5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy +
78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age +
267.935139741526 * Fuel Type Dummy
Interpretation:
Sign of
Coefficients Inference
Coefficient
Intercept 625.005
Avg Age -5.560 -ve Higher is the age, lower is the loss
Gender Dummy 50.883 +ve Average Loss for Males is higher than Females
Married Dummy 78.402 +ve Average Loss for Single is higher than Married
Avg Vehicle Age -15.142 -ve Older is the vehicle, lower are the losses
Fuel Type Dummy 267.935 +ve Losses are higher for Fuel type D

Illustration of using the equation given in MS Excel

EduPristine | Business Analytics 104


Multivariate Linear Regression- Residual Plot
Residual plot:
Residuals calculated as Actual Capped Losses Predicted Capped Losses
Residuals should have a uniform distribution else theres some bias in the model
Except for a few observations (circled in red), residuals are uniformly distributed

Capped Losses- Residual


1200

1000

800

600

400

200

0
0 2000 4000 6000 8000 10000 12000 14000

-200

-400

EduPristine | Business Analytics 105


Multivariate Linear Regression- Gains Chart and Gini
Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of
Gini Coefficient
Methodology illustrated using MS Excel
Equal Obs Cumulative Cumulative % % Cumulative Actual Area Under
# Policies Predicted Loss Actual Loss Random Gini Coeff
Bin Actual Loss Obs Loss Gains Curve
0 0 0 0 0 0 0 0 0 0.27177
1 1528 1,167,070 1,230,474 1,230,474 10% 10% 20.87% 0.0104
2 1529 1,046,034 991,944 2,222,418 10% 20% 37.69% 0.0293
3 1529 757,330 746,854 2,969,272 10% 30% 50.36% 0.0440
4 1529 589,366 552,534 3,521,806 10% 40% 59.73% 0.0550
5 1529 531,160 553,919 4,075,725 10% 50% 69.12% 0.0644
6 1529 485,428 477,284 4,553,009 10% 60% 77.22% 0.0732
7 1529 432,934 385,411 4,938,420 10% 70% 83.75% 0.0805
8 1529 385,595 423,814 5,362,234 10% 80% 90.94% 0.0873
9 1529 308,050 310,846 5,673,081 10% 90% 96.21% 0.0936
10 1530 193,465 223,351 5,896,432 10% 100% 100.00% 0.0981

Actual vs Predicted Losses Gains Chart


1,400,000 3500 100%
# Policies
1,200,000 3000

%Cumulative Actual Loss


Predicted Loss 80%
1,000,000 2500
Losses

Actual Loss # Policies


800,000 2000 60%
Cumulative % Obs
600,000 1500 40%
400,000 1000 % Cumulative Actual Loss
20%
200,000 500

- 0 0%
0 2 4 6 8 10 12 0 2 4 6 8 10
Bins of Equal # Policies Bins of Equal # Policies

EduPristine | Business Analytics 106


Heteroskedasticity
When the requirement of a constant variance is violated, we have a condition of
heteroskedasticity.

Error u

Predicted y

We can diagnose heteroskedasticity by plotting the residual against the predicted y.

EduPristine | Business Analytics 107


Unconditional and Conditional Heteroskedasticity
Presence of heteroskedasticity in the data is the violation of the assumption about the constant
variance of the residual term.
Heteroskedasticity takes the following two forms, unconditional and conditional.
Unconditional Heteroskedasticity is present when the variance of the residual terms are not
related to the values of the independent variable.
Unconditional Heteroskedasticity doesn't pose any problem in the regression analysis as the variance
doesn't change systematically
Conditional Heteroskedasticity pose problems in regression analysis as the residuals are
systematically related to the independent variables

Y b 0 b1 X
Y

Low Variance of
Residual Terms

High Variance of
Residual Terms

X
EduPristine | Business Analytics 108
Detecting Heteroskedasticity
Heteroskedasticity can be detected either by viewing the scatter plots as discussed in the previous
case or by Breusch-Pagan chi-square test.
In Breusch-Pagan chi-square test, the residuals are regressed with the independent variables to
check whether the independent variable explains a significant proportion of the squared residual
or not.
If the independent variables explain a significant proportion of the squared residuals then we
conclude that the conditional heteroskedasticity is present otherwise not.
Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom, where k is
the number of independent variables.
BP Chi Square Test Statistic n Rresid
2

where:
n: number of observations
2
Rresid :Coefficient of determination when residuals are regressed with independent variables

Conditional Heteroskedasticity can be corrected by using White-corrected standard errors


which are also called heteroskedasticity consistent standard errors
EduPristine | Business Analytics 109
Correcting for Heteroskedasticity

There are two methods for correcting the effects of conditional heteroskedasticity
Robust Standard Errors
Correct the standard errors of the linear regression model's estimated coefficients to account for
conditional heteroskedasticity
Generalized Least Squares
Modifies the original equation in an attempt to eliminate heteroskedasticity.
Statistical packages are available are available for computing robust standard errors.

EduPristine | Business Analytics 110


Multivariate Linear Regression- Heteroskedasticity (Age)
Age Band Mean Variance
16-25
16-25 514 72,460
26-59 414 24,652
60+ 208 21,778

Variance for the three age bands differ.


Showing heteroskedasticity.

26-59 60+

EduPristine | Business Analytics 111


Multivariate Linear Regression- Heteroskedasticity (Gender)
Gender Mean Variance
F
F 342 35,277
M 431 65,878

Variance differs by Gender.


Showing heteroskedasticity.

EduPristine | Business Analytics 112


Multivariate Linear Regression- Heteroskedasticity (Married)
Married Mean Variance
Married
Married 323 30,380
Single 451 66,779

Variance differs by Married.


Showing heteroskedasticity.

Single

EduPristine | Business Analytics 113


Multivariate Linear Regression- Heteroskedasticity (Vehicle Age)
Vehicle Age Band Mean Variance
0-5
0-5 509 65,688
6-10 369 39,155
11+ 325 43,066

Variance for the three vehicle age bands differ.


Showing heteroskedasticity.

6-10 11+

EduPristine | Business Analytics 114


Multivariate Linear Regression- Heteroskedasticity (Fuel Type)
Fuel Mean Variance
D
D 706 33,862
P 286 16,400

Variance differs by Fuel.


Showing heteroskedasticity.

EduPristine | Business Analytics 115


Multivariate Linear Regression- Fixing Heteroskedasticity
Univariate scenario:
Find the Standard Deviation of response variable for the different levels of independent variable
Divide the independent values of the response variable by the respective standard deviation
The scaled values become the new response variable
E.g. if variable is Fuel Type
If Fuel Type is D, divide Capped Losses by SquareRoot(33862) = 184
If Fuel Type is P, divide Capped Losses by SquareRoot(16400) = 128
Multivariate scenario:
Create all possible unique combinations of independent variables
For each of the combinations, find Standard Deviations
Divide the independent values of the response variable by the respective standard deviation
Too cumbersome to do manually using MS Excel. Also the process is iterative.
More convenient to do using Statistical packages like R.
Course approach
First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables
Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual
adjustment is always questionable.
Demonstrate linear regression using R

EduPristine | Business Analytics 116


Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)

1 Create unique combinations of the


variables - Avg Age, Gender Dummy,
Married Dummy, Avg Vehicle Age and
Fuel Type Dummy

Find Standard
2 Deviation of capped
Losses for the segments.
Detailed methodology
explained in MS Excel.

Calculate Standardized
3 Capped Losses as
Capped Losses /
Segment Std Dev.
This becomes the new
response variable.

Manually doing this kind of exercise can be flawed as some of the segments could be sparsely populated.
This demo. Is just to explain the underlying technique/methodology.
Statistical packages like SAS, R have in-built capability to take care of this.

EduPristine | Business Analytics 117


Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.359167467

R Square 0.129001269
Adjusted R Square 0.128716331
Standard Error 4.77078689
Observations 15290
Insignificant which is questionable
as D and P have significantly
ANOVA
different mean losses
df SS MS F Significance F
Regression 5 51522.10 10304.42 452.73 0
Residual 15284 347870.07 22.76
Total 15289 399392.17

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862
Avg Age -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081
Gender Dummy 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368
Married Dummy -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047
Avg Vehicle Age -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354
Fuel Type Dummy 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361

EduPristine | Business Analytics 118


Multivariate Linear Regression- Using R
Step1: Download and install R software from http://www.r-project.org/
Step2: Convert the data to R readable format e.g. *.csv.
D:\Business Analytics\Linear Regression\Data.csv
Writing R code for
Reading the data
Fitting the Linear Regression

#Set Working Directory


setwd("D:/Business Analytics/Linear Regression")

#Read the data file


DefaultData<-read.csv("Data.csv")

EduPristine | Business Analytics 119


Code Output

#Check what is there in the file


DefaultData

EduPristine | Business Analytics 120


Code Output

#Check what is there in the file


View(DefaultData)

EduPristine | Business Analytics 121


Code Output

#Check if the data is populated/imported properly


head(DefaultData)
tail(DefaultData)

EduPristine | Business Analytics 122


Code Output

#Check the summary of the file


summary(DefaultData)

EduPristine | Business Analytics 123


Code Output

#Generate plot of Dependent variable (Losses)


plot(DefaultData$Losses)

EduPristine | Business Analytics 124


Code Output

#Check the quantile to find out the outlier limit


quantile(DefaultData$Losses, c(0,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99,0.995,1))

#Creating the Capped Losses column with 1200 cap


DefaultData$CappedLosses<-ifelse(DefaultData$Losses>1200,1200,DefaultData$Losses)

#Check if Capped Losses column has been created properly or not


summary(DefaultData)

EduPristine | Business Analytics 125


Code Output

#Create new object deleting Losses


DefaultData3<-DefaultData[,-c(9)]

#Check the headings of the new object


names(DefaultData3)

EduPristine | Business Analytics 126


Code Output

#Generate plots to see the relation between the independent variables and the dependent variable
plot(DefaultData3$Age,DefaultData3$CappedLosses)
plot(DefaultData3$Years.of.Driving.Experience,DefaultData3$CappedLosses)
plot(DefaultData3$Number.of.Vehicles,DefaultData3$CappedLosses)
plot(DefaultData3$Gender,DefaultData3$CappedLosses)
plot(DefaultData3$Married,DefaultData3$CappedLosses)
plot(DefaultData3$Vehicle.Age,DefaultData3$CappedLosses)
plot(DefaultData3$Fuel.Type,DefaultData3$CappedLosses)

EduPristine | Business Analytics 127


Code Output

EduPristine | Business Analytics 128


Code Output

#Need to see Losses Distribution among all independant variables. Pivot Table in R with melt and cast
install.packages("reshape")
library("reshape")

#First look at the Data names, by which we want to create pivot table
names(DefaultData3)
#Melt the data: Melt will identify items to be summed/average (Called Measures) and separate out id
variables by which we want to add/average
data.m<-melt(DefaultData3, id=c(1:8), measure=c(9))
#Let's look at the different values of our new object
head(data.m)
#CappedLosses have been melted

EduPristine | Business Analytics 129


Code Output

#Let's cast our data


cast(data.m, Age~variable,fun.aggregate=sum)
data.c<-cast(data.m, Age~variable,mean)
data.c
data.c<-cast(data.m, Age~variable,c(sum,mean))
data.c

EduPristine | Business Analytics 130


Code Output

#Create Age Bands from 16 to 25, 26 to 59 and 60+


DefaultData3$AgeBand<-ifelse(DefaultData3$Age<=25,"16-25",
ifelse(DefaultData3$Age>=60,"60+","26-59"))

#Let's see if we are able to do the conversion in a correct way or not


head(DefaultData3)
tail(DefaultData3)

EduPristine | Business Analytics 131


Code Output

#Create AgeBand - Average


data.ageband<-aggregate(Age~AgeBand,data=DefaultData3,mean)
data.ageband

#Merge two object - Just like vlookup in Excel


DefaultData3<-merge(DefaultData3,data.ageband, by="AgeBand")

#View the data if it has been looked up correctly


View(DefaultData3)
#We can export data from R to excel
write.csv(DefaultData3,"Data1.csv")
#Similarly we can convert Vehicle Age to Vehicle Age Band

EduPristine | Business Analytics 132


Code Output

#Convert Categorical varibles in Dummy Variable


DefaultData3$GenderDummy<-ifelse(DefaultData3$Gender=="F",1,0)
#Similarly we can covert other categorical variables to Dummy Variables

#Check the headings and the summary


names(DefaultData3)
summary(DefaultData3)

EduPristine | Business Analytics 133


Code Output

#We will use the data which has been converted into bands and dummy variables. Read the final Data
DefaultData4<-read.csv("Linear_Reg_Sample_Data.csv")

#Look at the column Headings


names(DefaultData4)

EduPristine | Business Analytics 134


Code Output

#Install car Package for cif (Multicollinearity)


install.packages("car")

#Look at the column Headings


names(DefaultData4)

#Load the car package


library("car")

EduPristine | Business Analytics 135


Code Output

#Create linear function for vif


vif_data<-
lm(Capped_Losses~Years_Drv_Exp+Number_Vehicles+Average_Age+Gender_Dummy+Married_Dummy+Avg
_Veh_Age+Fuel_Type_Dummy,data=DefaultData4)

#Check Vif, vif>2 means presence of multicollinearity


vif(vif_data)

EduPristine | Business Analytics 136


Code Output
#Compare R-square of Average_Age and Years_Drv_Exp to check which performs better
age1<-lm(Capped_Losses~Average_Age,data=DefaultData4)
drv1<-lm(Capped_Losses~Years_Drv_Exp,data=DefaultData4)
summary(age1)
summary(drv1)
#keep Average_Age and remove Years_Drv_Exp

EduPristine | Business Analytics 137


Code Output
#Run Linear Regression w/o Years_Drv_Exp
lin_r1<-
lm(Capped_Losses~Number_Vehicles+Average_Age+Gender_Dummy+Married_Dummy+Avg_Veh_Age+Fuel_
Type_Dummy,data=DefaultData4)

#Let's look at the results


summary(lin_r1)
#Remove Number_Vehicles

Not Significant

EduPristine | Business Analytics 138


Code Output
#Run Linear Regression w/o Number_Vehicles
lin_r2<-
lm(Capped_Losses~Average_Age+Gender_Dummy+Married_Dummy+Avg_Veh_Age+Fuel_Type_Dummy,data
=DefaultData4)

#Let's look at the results


summary(lin_r2)

EduPristine | Business Analytics 139


Code Output
#Variance Covariance Matrix
install.packages("sandwich")
library("sandwich")
vcovHC(lin_r2,omega=NULL, type="HC4")

EduPristine | Business Analytics 140


Code Output
#Fixing Heteroskedasticity using Variance-Covariance matrix
install.packages("lmtest")
library("lmtest")
coeftest(lin_r2,df=Inf,vcov=vcovHC(lin_r2,type="HC4"))

Heteroskedasticity impacts std. Error


coeftest() fixes the std error by taking into account Var-Covar matrix
There negligible change on coefficients

Illustration of using the equation given in MS Excel

EduPristine | Business Analytics 141


Thank you!

help@edupristine.com
www.edupristine.com/ca

EduPristine www.edupristine.com/ca
EduPristine | Business Analytics 142

You might also like