Professional Documents
Culture Documents
EduPristine www.edupristine.com
EduPristine | Business Analytics
Agenda
Basic Statistics
The sampling distribution of the sample mean is a probability distribution consisting of all possible sample
means of a given sample size selected from a population.
X
For a population with a mean and a variance 2 the sampling distribution of the means of all possible
samples of size n generated from the population will be approximately normally distributed.
The mean of the sampling distribution equal to and the variance equal to 2/n.
How is variance related to standard error?
For example, if a poll was conducted where the population included all students in that school and the
sample was a class. If the sample had a mean GPA of 3.4, and the populations mean GPA was 3.2, then the
sample error was 0.2.
7
EduPristine | Business Analytics
Standard Error of sample mean
It is the standard deviation of the distribution of the sample means
When the standard deviation of the population is known, the standard error of the sample mean is
calculated as:
Answer: Because the population standard deviation is known, the standard error of the sample mean is
expressed as = $2.90/ root of (30) = $0.53
8
EduPristine | Business Analytics
Point Estimate & Confidence Interval
Point estimates: These are the single (sample) values used to estimate population parameters
Confidence interval: It is a range of values in which the population parameter is expected to lie
Confidence interval takes on the following form where N 30
CI = + Z*
True for a population distribution
Where, is the mean of the population
is the standard deviation of the population
For a sample mean,
Point estimate + (reliability factor * standard error )
CI = + Z*(/n)
A statistical hypothesis test is a method of making statistical decisions from and about
experimental data.
How well the findings fit the possibility that chance factors alone might be responsible."
Null Hypothesis (H0): The hypothesis that the researcher wants to reject
Test Statistic
Rejection/Critical Region
Conclusion
To reach a final decision, Sam has to make a general inference (about the population) from the
sample data.
Criterion: Mean income across all households in the market area under consideration.
If the mean population household income is greater than $19,000, then PD should introduce the product
line into the new market.
The term one-tailed signifies that all z-values that would cause Sam to reject H0, are in just one
tail of the sampling distribution
-> Population Mean
H0: $19,000
Ha: $19,000
13
EduPristine | Business Analytics
Identifying the Critical Sample Mean Value Sampling Distribution
0.25
0.2
0.15
Critical Value
0.1 (Xc)
0.05
0
-10 -5 $19,000
0 5 10
Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling
distribution centered on = $19,000--suggest that H0 may be false.
More important the farther to the right x is , the stronger is the evidence against H0
14
EduPristine | Business Analytics
Computing the Criterion Value
Standard deviation for the sample of 100 households is $4,000. The standard error of the mean (sx)
is given by:
s
sx $400
n
Critical mean household income xc through the following two steps:
Determine the critical z-value, zc. For =0.05:
zc = 1.645.
Substitute the values of zc, s, and (under the assumption that H0 is "just" true )
Critical Value xc
xc = + zcs = $19,658.
In this case, since the observed sample statistic (20,000) is greater than the critical value (19,658), so the null
hypothesis is rejected =>
Decision Rule
If the sample mean household income is greater than $19,658, reject the null hypothesis and
introduce the new course
15
EduPristine | Business Analytics
Test Statistic
The value of the test statistic is simply the z-value corresponding to = $20,000.
x 0.25
Z 2.5
sx 0.2
0.1 = 0.05
0.05
0
There is a significant difference in the -10 x=
5 $ 20,000
hypothesized population parameter and
-5 =$19,000
0 10
16
EduPristine | Business Analytics
Errors in Estimation
Please note: You are inferring for a population, based only on a sample
This is no proof that your decision is correct
Its just a hypothesis
Actual
There is still a chance that your inference is wrong
H0 is True H0 is False
How do I quantify the prob. of error in inference?
Inference
Type I and Type II Errors:
Type I error occurs if the null hypothesis is H0 is True
Correct Decision Type-II Error P(Type-II
rejected when it is true Confidence Level=1- Error)=
Type II error occurs if the null hypothesis is
not rejected when it is false H0 is False Power=1-
Type-I Error Significance
Level=
Significance Level:
-> Significance level : The upper-bound probability of a Type I error
1 - ->confidence level : The complement of significance level
The power of a test is the probability of correctly rejecting the null.
P-value 0.2
Z=0
Z / 2 * 19,000 1.95 * 400 $19,784
Reject H0 Reject H0
Conclusion: If the household income lies Do not
between $18,216 and $19,784 then the Reject H0
student will attend the course at 95%
confidence
20
EduPristine | Business Analytics
Business Analytics
Predictive Modeling using Linear
Regression
EduPristine www.edupristine.com
EduPristine | Business Analytics
Agenda
Basic Statistics
II. Regression
(X i X )(Y iY )
cov xy i 1
n 1
Correlation coefficient
It is a measure of the strength of the linear relationship between two variables
The correlation coefficient is given by:
cov xy
xy
x y
Population correlation is denoted by (rho)
Sample correlation is denoted by r. It is an estimate of same way as
S2 (sample variance) is an estimate of 2 (population variance) and
X (sample mean) is an estimate of (population mean)
Features of and r
Unit free and ranges between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship
Date S&P 500 NASDAQ S&P 500 NASDAQ S&P 500 NASDAQ
12/2/2011 1,244.28 2,626.93 Xi Yi Xi- X Yi- Y (Xi-X )*(Yi- Y )
12/5/2011 1,257.08 2,655.76 1.03% 1.10% 1.14% 1.20% 0.0137%
12/7/2011 1,261.01 2,649.21 0.31% -0.25% 0.43% -0.15% -0.0006%
12/8/2011 1,234.35 2,596.38 -2.11% -1.99% -2.00% -1.89% 0.0378%
12/9/2011 1,255.19 2,646.85 1.69% 1.94% 1.80% 2.05% 0.0369%
12/12/2011 1,236.47 2,612.26 -1.49% -1.31% -1.38% -1.21% 0.0166%
X Y
y y y
x x x
r = -1 r = -0.6 r=0
y y
x x
r = +.3 r = +1
EduPristine | Business Analytics 28
4a. Testing the significance of the correlation coefficient
Test whether the correlation between the population of two variables is equal to zero
Null hypothesis, H0: r = 0
Assuming that the two populations are normally distributed, we can use a t-test to determine
whether the null hypothesis should be rejected.
The test statistic is computed using the sample correlation, r, with n 2 degrees of freedom (df )
t = r (n-2)
(1- r2)
Calculated test statistic is compared with the critical t-value for the appropriate degrees of
freedom and level of significance
Reject H0 if t > tcritical or t <-tcritical
Example: Correlation of the S&P 500 and NASDAQ Returns given a sample
n = 5, r = 0.979811179, v = 5-2 = 3
Calculate, t = 8.4885
tcritical at 95% confidence interval (df = 3) = 2.3534
Hence, reject H0 at CI of 95%
100
Marks in Test
80
60
40
20
0
0 10 20 30 40 50 60 70 80 90
Hours of Study
Independent variable: the variable used to explain the dependent variable. Denoted by X
E
i
2 2
5. The residual term is independently distributed; that is, the residual for one observation is not
correlated with that of another observation
[E( i j ) 0, j i]
6. The residual term is normally distributed.
(continued)
Y Y 0 1 X u
ui Slope = 1
Intercept = 0 Individual
person's marks
xi X
37
EduPristine | Business Analytics
4.b. Population Regression Function
Random Error
Dependent Population y Population Slope Independent term, or
Variable intercept Coefficient Variable residual
Y 0 1 X u
Linear component Random Error
component
(continued)
Y y b 0 b1 x e
Observed Value
of y for xi
ei Slope = 1
Intercept = 0
xi X
40
EduPristine | Business Analytics
4.b. Sample Regression Function
Y 0 1 X u
yi b 0 b1x
e 2
(y y) 2
(y (b 0 b 1 x)) 2
The sum of the residuals from the least squares regression line is 0.
( y y ) 0
The sum of the squared residuals is a minimum.
Minimize ( ( y
y ) 2
)
The simple regression line always passes through the mean of the y variable and the mean of
the x variable
b0 is the estimated average value of y when the value of x is zero. More often than not it does
not have a physical interpretation
b1 is the estimated change in the average value of y as a result of a one-unit change in x
y
Y b0 b1 X
slope of the line(b1)
b0
Once we make these assumptions about "u" we are able to estimate the variance and standard
errors of b0 and b1 and this has been possible because of the properties
of OLS method (beyond the scope of lecture)
RSS
su
n k 1
Where
RSS= Residual Sum of Squares (summation of e2)
n = Sample size
k = number of independent variables in the model
Note: When k=1
RSS
su = Sample standard error of the estimate
n2
y y
small s u x smallsb1 x
y y
Y (tc * s f )
where:
Y = predicted 'Y' value (dependent variable)
n-2 = degrees of freedom
tc = the critical t value
s f = the standard error of the forecast
b1 (tc * sb )
1
where:
b1 = correlation between x and y
n-2 = degrees of freedom
tc = the critical t value
sb = the standard error of the regression coefficient
1
y
yi SSE = Sum of squared errors
y
SST = Total Sum of Squares SSE = (yi - yi )2
_
SST = (yi - y)2
y _ 2
_ RSS = (yi - y) _
y RSS = Regression sum of squares
y
Xi x
SST ( y y ) 2
SSE ( y y ) 2 SSR ( y y ) 2
Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
y = Estimated value of y for the given x value
The coefficient of determination is the portion of the total variation in the dependent variable
that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2
SSR 0 R 1 2
R 2
where
SST
Coefficient of determination
(continued)
SSR sum of squaresexplained by regression
R 2
SST total sum of squares
Where: R2 r 2
R2 = Coefficient of determination
r = Simple correlation coefficient
y
R2 = 1
x
R2 = +1
y
0 < R2 < 1
y R2 = 0
R2 = 0 x
Yi b0 b1 X 1i b2 X 2i ......... bk X ki i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
i = error term of ith observation
n = number of observations
k = total number of independent variables
i 1
is minimized and the slope coefficient is estimated.
b0 Value of Y
when X 1 X 2 ....... X k 0
Slope coefficient (bk): It's the change in the dependent variable from a unit change in the
corresponding independent (Xk) variable keeping all other independent variables constant.
In reality when the value of the independent variable changes by one unit, the change in the dependent
variable is not equal to the slope coefficient but depends on the correlation among the independent
variables as well.
Therefore, the slope coefficient are called partial slope coefficients as well
The expected value of the error term, conditional on the independent variables is zero.
The error terms are homoskedastic, i.e. the variance of the error terms is constant for all the
observations.
The expected value of the product of error terms is always zero, which implies that the error
terms are uncorrelated with each other.
The independent variables doesn't have any linear relationships between each other.
If the value of t-statistic lies within the confidence interval, H0 can't be rejected
Where,
tc is the critical t-value, and
sb is the standard error
j
Where,
Y is the predicted value for the dependent variable
i
bi is the estimated partial slope for the ith independent variable
X ni is the forecasted ith value for the nth independent variable
SSE
Standard Error of Estimate(SEE) = MSE
n2
Total Variation( SST) Unexplaine d Variation( SSE)
Coefficient of determination(R2) =
Total Variation( SST)
Explained Variation( RSS)
=
Total Variation( SST)
n: Number of observations
n 1 2
where R 2
a 1 n k 1 1 R
n = Number of Observations
k = Number of Independent Variables
Ra2 = Adjusted R2
If we wanted to consider the spike in soft drink sales in the summer, we may have a regression
equation:
Rev(t) 10,000 2,000t 50,000S
Here,
1 if it' s summer
S
0 if it' s not summer
If there are n mutually exclusive and exhaustive classes, they can be represented by n-1 dummy
variables. This is derived from the concept of degrees of freedom.
For example, to represent the 4 stages of the business cycle, we can use 3 dummy variables.
The fourth variable would be represented by zeros for all three dummy variables.
We do not use 4 variables as that would indicate a linear relationship between all 4 variables.
Nonstationarity means that the properties(like mean, variance) of the variables is not constant
The economic meaning for this equation is given by the partial slopes or coefficients of the
variables.
If the GDP Growth rate was 1% higher, it translates into a 0.75% higher Revenue growth.
Similarly, if the WPI Inflation figures were 1% higher, it translates into a 0.5% higher revenue
growth.
Adam, an Analytics consultant works with First Auto Insurance company. His manager gave him data
having Loss amount and policy related information and asked him to identify and quantify the
factors responsible for losses in a multivariate fashion. Adam has no knowledge of running a
multivariate regression.
Now suppose, he approaches you and request for your help to complete the assignment. Lets help
Adam in carrying out the multivariate regression.
6 Married Marital status of the Policy holder Married, Single Categorical (binary)
Min 13 3500
1 1 Outliers
1
5th 67 3000
10th 122 2500
25th 226
2000
50th 355
1500
75th 399
90th 685 1000
99.50th 1,366
Max 3,500
Mean 390
Stddev 254
500
3%
400
% Policies
Loss
300 2%
200
1%
100
- 0%
0 10 20 30 40 50 60 70 80
Age
% Obs Average Loss Average Capped Losses
Age Band
600 50%
500 40%
400
% Policies
30%
Loss
300
20%
200
100 10%
- 0%
16-25 26-59 60+
Age Band
% Policies Average Loss Average Capped Loss
EduPristine | Business Analytics 86
Multivariate Linear Regression- Bivariate Profiling
9.0%
500
8.0%
7.0%
400
6.0%
% Policies
Loss
300 5.0%
4.0%
200
3.0%
2.0%
100
1.0%
- 0.0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52
500
Gender 51%
400 51%
% Policies
300 50%
Loss
200 50%
100 49%
- 49%
F M
Gender
% Obs Average Loss Average Capped Losses
Marital Status
500 60%
450
400 50%
350 40%
% Policies
300
Loss
250 30%
200
150 20%
100 10%
50
- 0%
Married Single
Marital Status
% Obs Average Loss Average Capped Losses
EduPristine | Business Analytics 88
Multivariate Linear Regression- Bivariate Profiling
Vehicle Age
600 9%
8%
500
7%
400 6%
% Policies
5%
Loss
300
4%
200 3%
2%
100
1%
- 0%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vehicle Age
% Obs Average Loss Average Capped Losses
% Policies
25%
Loss
300
20%
200 15%
10%
100
5%
- 0%
0-5 6-10 11+
Vehicle Age Band
% Obs Average Loss Average Capped Losses
EduPristine | Business Analytics 89
Multivariate Linear Regression- Bivariate Profiling
# Vehicles
400 40%
35%
395
30%
25%
% Policies
390
Loss
20%
385 15%
10%
380
5%
375 0%
1 2 3 4
# Vehicles
% Obs Average Losses Average Capped Losses
Fuel Type
800 90%
700 80%
600 70%
60%
500
% Policies
50%
Loss
400
40%
300
30%
200 20%
100 10%
- 0%
D Fuel Type P
2. Number of Vehicles
3. Gender
4. Married
5. Vehicle Age Band in the form of Average Vehicle Age of the band (selected out of Vehicle Age and Vehicle
Age Band).
6. Fuel Type
We will run regression in multivariate fashion and then select final list of variables by taking into
consideration statistical significance.
3. Fuel Type (P = 0, D = 1)
Snapshot of the final data on which we will run the multivariate regression
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.865972274
R Square 0.749907979
Adjusted R Square 0.749809794
Standard Error 114.4310136
Observations 15290
ANOVA
df SS MS F Significance F
Regression 6 600073213.5 100012202.3 7637.751088 0
Residual 15283 200122584.4 13094.45688
Total 15289 800195798
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 624.56529 5.29192 118.02233 0.00000 614.19249 634.93809 614.19249 634.93809
Avg Age -5.55974 0.06546 -84.93889 0.00000 -5.68804 -5.43144 -5.68804 -5.43144
Number of Vehicles 0.17875 0.97039 0.18420 0.85386 -1.72333 2.08082 -1.72333 2.08082
Gender Dummy 50.88326 1.89081 26.91084 0.00000 47.17705 54.58947 47.17705 54.58947
Married Dummy 78.39837 1.92148 40.80106 0.00000 74.63204 82.16469 74.63204 82.16469
Avg Vehicle Age -15.14220 0.26734 -56.63987 0.00000 -15.66623 -14.61818 -15.66623 -14.61818
Fuel Type Dummy 267.93559 2.74845 97.48614 0.00000 262.54830 273.32287 262.54830 273.32287
EduPristine | Business Analytics 99
Multivariate Linear Regression- Output
1 Independent Vars Coefficients(b)
Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)
Insignificant
ANOVA
2 Significance F (from
df SS MS (SS/df) F (MSReg/MSRes)
F dist table)
Regression { (ypredictedl- ymean)2} p 6 600073213.5 100012202.3 7637.75 0
Residual {(yactual - ypredicted)2} n-p-1 15283 200122584.4 13094.457
Total {(yactual - ymean)2} n-1 15289 800195798
Regression Statistics
3
Multiple R SquareRoot(R Square) 0.8659723
R Square SS Regression/SS Total 0.7499080
Adjusted R Square R2 - (1 - R2)*{p/(n-p-1)} 0.7498098
Standard Error SquareRoot{SS Residual/(n-p-1)} 114.4310136
Observations n 15290
EduPristine | Business Analytics 100
Multivariate Linear Regression- Output (Significance Test)
1 Independent Vars Coefficients(b)
Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
() (b/) (t-dist table) (b-1.96*) (b+1.96*) (b-1.96*) (b+1.96*)
ANOVA
df SS MS F Significance F
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 625.005 4.723 132.333 0.00 615.7474 634.2625 615.7474 634.2625
Avg Age -5.560 0.065 -84.942 0.00 -5.6879 -5.4314 -5.6879 -5.4314
Gender Dummy 50.883 1.891 26.912 0.00 47.1768 54.5890 47.1768 54.5890
Married Dummy 78.402 1.921 40.806 0.00 74.6356 82.1677 74.6356 82.1677
Avg Vehicle Age -15.142 0.267 -56.641 0.00 -15.6660 -14.6180 -15.6660 -14.6180
Fuel Type Dummy 267.935 2.748 97.489 0.00 262.5480 273.3223 262.5480 273.3223
1000
800
600
400
200
0
0 2000 4000 6000 8000 10000 12000 14000
-200
-400
- 0 0%
0 2 4 6 8 10 12 0 2 4 6 8 10
Bins of Equal # Policies Bins of Equal # Policies
Error u
Predicted y
Low Variance of
Residual Terms
High Variance of
Residual Terms
X
EduPristine | Business Analytics 108
Detecting Heteroskedasticity
Heteroskedasticity can be detected either by viewing the scatter plots as discussed in the previous
case or by Breusch-Pagan chi-square test.
In Breusch-Pagan chi-square test, the residuals are regressed with the independent variables to
check whether the independent variable explains a significant proportion of the squared residual
or not.
If the independent variables explain a significant proportion of the squared residuals then we
conclude that the conditional heteroskedasticity is present otherwise not.
Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom, where k is
the number of independent variables.
BP Chi Square Test Statistic n Rresid
2
where:
n: number of observations
2
Rresid :Coefficient of determination when residuals are regressed with independent variables
There are two methods for correcting the effects of conditional heteroskedasticity
Robust Standard Errors
Correct the standard errors of the linear regression model's estimated coefficients to account for
conditional heteroskedasticity
Generalized Least Squares
Modifies the original equation in an attempt to eliminate heteroskedasticity.
Statistical packages are available are available for computing robust standard errors.
26-59 60+
Single
6-10 11+
Find Standard
2 Deviation of capped
Losses for the segments.
Detailed methodology
explained in MS Excel.
Calculate Standardized
3 Capped Losses as
Capped Losses /
Segment Std Dev.
This becomes the new
response variable.
Manually doing this kind of exercise can be flawed as some of the segments could be sparsely populated.
This demo. Is just to explain the underlying technique/methodology.
Statistical packages like SAS, R have in-built capability to take care of this.
Regression Statistics
Multiple R 0.359167467
R Square 0.129001269
Adjusted R Square 0.128716331
Standard Error 4.77078689
Observations 15290
Insignificant which is questionable
as D and P have significantly
ANOVA
different mean losses
df SS MS F Significance F
Regression 5 51522.10 10304.42 452.73 0
Residual 15284 347870.07 22.76
Total 15289 399392.17
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862
Avg Age -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081
Gender Dummy 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368
Married Dummy -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047
Avg Vehicle Age -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354
Fuel Type Dummy 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361
#Generate plots to see the relation between the independent variables and the dependent variable
plot(DefaultData3$Age,DefaultData3$CappedLosses)
plot(DefaultData3$Years.of.Driving.Experience,DefaultData3$CappedLosses)
plot(DefaultData3$Number.of.Vehicles,DefaultData3$CappedLosses)
plot(DefaultData3$Gender,DefaultData3$CappedLosses)
plot(DefaultData3$Married,DefaultData3$CappedLosses)
plot(DefaultData3$Vehicle.Age,DefaultData3$CappedLosses)
plot(DefaultData3$Fuel.Type,DefaultData3$CappedLosses)
#Need to see Losses Distribution among all independant variables. Pivot Table in R with melt and cast
install.packages("reshape")
library("reshape")
#First look at the Data names, by which we want to create pivot table
names(DefaultData3)
#Melt the data: Melt will identify items to be summed/average (Called Measures) and separate out id
variables by which we want to add/average
data.m<-melt(DefaultData3, id=c(1:8), measure=c(9))
#Let's look at the different values of our new object
head(data.m)
#CappedLosses have been melted
#We will use the data which has been converted into bands and dummy variables. Read the final Data
DefaultData4<-read.csv("Linear_Reg_Sample_Data.csv")
Not Significant
help@edupristine.com
www.edupristine.com/ca
EduPristine www.edupristine.com/ca
EduPristine | Business Analytics 142