Session-Correlation and Regression

Correlation and Regression
R.Venkatesakumar
Department of Management Studies
School of Management, Pondicherry University
Puducherry, INDIA
Correlation Coefficient
It is a measure to quantify the co-variation

or association between two variables.
Whether ad budget is associated with overall
target/sales?
Maintenance expenditure & Mileage
Age of the vehicle & Maintenance
expenditure
Why we need to study correlation?
Number of Credit Cards (Y) Family Size (V1)
4 2
6 2
6 4
7 4
8 5
7 5
8 6
10 6
Column -1 data & Column data if any association / pattern

3
Why we need to study correlation?
Column -1 data & Column data if any
association / pattern
Then, the relation / association can be
explored
On occasion, knowing one variable
behaviour, you can trace the behaviour of the
other
4
The Correlation coefficient for two variables,
rxy
X and Y is denoted by r
a. r ranges from +1 to -1
b. r = +1 a perfect positive linear relationship
c. r = -1 a perfect negative linear relationship
d. r = 0 indicates no correlation
.
Simple Correlation Coefficient
rxy ryx
X X Y Y
i i
Xi X Yi Y
2 2
xy
rxy ryx x2
2
x
2
y
y2
xy
Correlation
Scatter Plot of the data- visual interpretation of the data

Y
Correlation Patterns
NO CORRELATION
X
.
Correlation Patterns
Y
A HIGH POSITIVE CORRELATION

r = +.98
X
.
Correlation
Does Not Mean Causation
High correlation
Roosters crow and the rising of the sun
Rooster does not cause the sun to rise.
Teachers salaries and the consumption of
liquor
Co-vary because they are both influenced by a
third variable
An Illustration of Multiple Regression
Number of Credit Cards (Y) Family Size (V1) Family Income (V2) Number of cars (V3)
4 2 14 1
6 2 16 2
6 4 14 2
7 4 17 1
8 5 18 3
7 5 21 2
8 6 17 1
10 6 25 2
11
Mean representation
If Mean alone used
as representative Number of Credit Cards
(Y) Y- (Y-)**
for the data then 4 -3 9
= Total / Number of
observations 6 -1 1
= 56 /8 = 7 6 -1 1
Total Error due to 7 0 0
mean representation 8 1 1
of the data is 22 7 0 0
8 1 1
How to reduce the 10 3 9
error and get better 0 22
representation for the

data? 12
How to reduce error?
Mean will produce lowest error term than

median or mode (if variable is normal, then all the
terms will produce equal error terms)
This error term going to play significant role in all
the statistical techniques (variance)

Need a new variable to reduce the error in the
consideration variable (dependent variable)
Variables which are related (correlation) to the
variable considered (Y-variable / dependent) may be
very useful
13
Correlation among variables
Number of Family Number of

Credit Family Size Income cars
Cards (Y) (V1) (V2) (V3)
Number of Credit
Cards
(Y) 1.0000
Family Size
(V1) 0.8664 1.0000
Family Income
(V2) 0.8290 0.6727 1.0000
Number of cars
(V3) 0.3419 0.1917 0.3008 1.0000
The variable with highest correlation is considered in

the regression 14
Prediction accuracy of regression
Prediction using Y = 2.871 + 0.971X
Number of Credit Cards

(Y) Family Size (V1) Prediction Y- (Y-)**
4 2 4.814285714 -0.81 0.663
6 2 4.814285714 1.19 1.406
6 4 6.757142857 -0.76 0.573
7 4 6.757142857 0.24 0.059
8 5 7.728571429 0.27 0.074
7 5 7.728571429 -0.73 0.531
8 6 8.7 -0.70 0.490
10 6 8.7 1.30 1.690
Prediction Error due to regression (Unexplained by regression)

0.00 5.486
15
Mathematics of regression
Total Error (due to mean representation) 22
Error in regression prediction 5.486
Error reduced by regression analysis 16.514
16.514 / 22
Error Reduction Ability of regression
0.7506
0.8664 * 0.8664
Correlation square
Coefficient of determination) R**
0.7506
16
Excel / SPSS output
Regression Statistics
Multiple R 0.866400225
ANOVA
R Square 0.750649351
df SS MS F Significance F
Adjusted R Square 0.709090909

Regression 1 16.514286 16.514286 18.0625 0.00538014
Standard Error 0.956182887 Residual 6 5.4857143 0.9142857
Total 7 22
Observations 8
Standard Lower Upper

Erro 95.0 95.0
Coefficients r t Stat P-value Lower 95% Upper 95% % %
Intercept 2.871428571 1.0285714 2.7916667 0.0315082 0.354604958 5.3882522 0.354605 5.3882522
X Variable 1 0.971428571 0.2285714 4.25 0.0053801 0.412134435 1.5307227 0.4121344 1.5307227
17
Reducing the error further
To reduce the error further in the dependent

variable we use one more independent
variable
The variable with next highest correlation
(preferably less correlation with existing
independent variable already in the model)
Correlation among three or more independent
variables referred as Multicollinearity
influence the further predictions
18
Prediction using
Y = 0.482 + 0.63 V1 + 0.216 V2
Number of Credit Family Income

Cards (Y) Family Size (V1) (V2) Y- (Y - )**
4 2 14 4.77 -0.77 0.586756
6 2 16 5.20 0.80 0.643204
6 4 14 6.03 -0.03 0.000676
7 4 17 6.67 0.33 0.106276
8 5 18 7.52 0.48 0.2304
7 5 21 8.17 -1.17 1.364224
8 6 17 7.93 0.07 0.004356
10 6 25 9.66 0.34 0.114244
0 3.050136
19
Total Error (due to mean representation) 22
Error in regression prediction 3.050
Error reduced by regression analysis 18.950
Error Reduction Ability of regression 18.950 / 22
0.8614
20
The logic
The logics are slightly different
Instead of correlation we have to
consider Partial Correlation
It is the measure of variation in
dependent variable (in the equation),
not accounted by the variable (s) in the
equation that can be accounted for by
each of these additional variables..
21
Consider the correlation & Partial correlation
Correlations
Control Variables VAR00001 VAR00003 VAR00004 VAR00002

-none- a VAR00001 Correlation 1.000 .829 .342 .866
Significance (2-tailed) . .011 .407 .005
df 0 6 6 6
VAR00003 Correlation .829 1.000 .301 .673
Significance (2-tailed) .011 . .469 .068
df 6 0 6 6
VAR00004 Correlation .342 .301 1.000 .192
Significance (2-tailed) .407 .469 . .649
df 6 6 0 6
VAR00002 Correlation .866 .673 .192 1.000
Significance (2-tailed) .005 .068 .649 .
df 6 6 6 0
VAR00002 VAR00001 Correlation 1.000 .666 .359
Significance (2-tailed) . .102 .429
df 0 5 5
VAR00003 Correlation .666 1.000 .237
Significance (2-tailed) .102 . .609
df 5 0 5
VAR00004 Correlation .359 .237 1.000
Significance (2-tailed) .429 .609 .
df 5 5 0
a. Cells contain zero-order (Pearson) correlations.
22
The logic
The highest Partial Variation accounted by fitted model 0.7506
correlation is Remaining variance 0.2494
between V2 & Y Partial correlation 0.666
(0.666) Partial correlation ** 0.4436
The incremental
Error reduction = (Remaining variance X
variance accounted Partial correlation squared)
= {0.2494 X (0.666 X 0.666)}
0.11062
by entering V2
23
Thank you
24

Session-Correlation and Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session-Correlation and Regression

Uploaded by

Copyright:

Available Formats

Correlation and Regression

It is a measure to quantify the co-variation

Number of Credit Cards (Y) Family Size (V1)

Column -1 data & Column data if any association / pattern

Scatter Plot of the data- visual interpretation of the data

A HIGH POSITIVE CORRELATION

Total Error due to 7 0 0

How to reduce the 10 3 9

error and get better 0 22

representation for the

Mean will produce lowest error term than

the statistical techniques (variance)

Number of Family Number of

The variable with highest correlation is considered in

Number of Credit Cards

4 2 4.814285714 -0.81 0.663

6 2 4.814285714 1.19 1.406

6 4 6.757142857 -0.76 0.573

7 4 6.757142857 0.24 0.059

8 5 7.728571429 0.27 0.074

7 5 7.728571429 -0.73 0.531

8 6 8.7 -0.70 0.490

10 6 8.7 1.30 1.690

Prediction Error due to regression (Unexplained by regression)

Total Error (due to mean representation) 22

Error in regression prediction 5.486

Error reduced by regression analysis 16.514

Error Reduction Ability of regression

Adjusted R Square 0.709090909

Standard Error 0.956182887 Residual 6 5.4857143 0.9142857

Standard Lower Upper

Intercept 2.871428571 1.0285714 2.7916667 0.0315082 0.354604958 5.3882522 0.354605 5.3882522

X Variable 1 0.971428571 0.2285714 4.25 0.0053801 0.412134435 1.5307227 0.4121344 1.5307227

To reduce the error further in the dependent

Number of Credit Family Income

4 2 14 4.77 -0.77 0.586756

6 2 16 5.20 0.80 0.643204

6 4 14 6.03 -0.03 0.000676

7 4 17 6.67 0.33 0.106276

8 5 18 7.52 0.48 0.2304

7 5 21 8.17 -1.17 1.364224

8 6 17 7.93 0.07 0.004356

10 6 25 9.66 0.34 0.114244

Error in regression prediction 3.050

Error reduced by regression analysis 18.950

Error Reduction Ability of regression 18.950 / 22

Control Variables VAR00001 VAR00003 VAR00004 VAR00002

The highest Partial Variation accounted by fitted model 0.7506

correlation is Remaining variance 0.2494

between V2 & Y Partial correlation 0.666

(0.666) Partial correlation ** 0.4436

You might also like