You are on page 1of 24

Correlation and Regression

R.Venkatesakumar
Department of Management Studies
School of Management, Pondicherry University
Puducherry, INDIA
Correlation Coefficient

It is a measure to quantify the co-variation


or association between two variables.
Whether ad budget is associated with overall
target/sales?
Maintenance expenditure & Mileage
Age of the vehicle & Maintenance
expenditure
Why we need to study correlation?

Number of Credit Cards (Y) Family Size (V1)

4 2

6 2

6 4

7 4

8 5

7 5

8 6

10 6

Column -1 data & Column data if any association / pattern


3
Why we need to study correlation?
Column -1 data & Column data if any
association / pattern
Then, the relation / association can be
explored
On occasion, knowing one variable
behaviour, you can trace the behaviour of the
other

4
The Correlation coefficient for two variables,

rxy
X and Y is denoted by r

a. r ranges from +1 to -1
b. r = +1 a perfect positive linear relationship
c. r = -1 a perfect negative linear relationship
d. r = 0 indicates no correlation

.
Simple Correlation Coefficient

rxy ryx
X X Y Y
i i

Xi X Yi Y
2 2

xy
rxy ryx x2

2
x
2
y
y2
xy
Correlation

Scatter Plot of the data- visual interpretation of the data


Y
Correlation Patterns
NO CORRELATION

X
.
Correlation Patterns
Y

A HIGH POSITIVE CORRELATION


r = +.98

X
.
Correlation
Does Not Mean Causation
High correlation
Roosters crow and the rising of the sun
Rooster does not cause the sun to rise.
Teachers salaries and the consumption of
liquor
Co-vary because they are both influenced by a
third variable
An Illustration of Multiple Regression

Number of Credit Cards (Y) Family Size (V1) Family Income (V2) Number of cars (V3)

4 2 14 1

6 2 16 2

6 4 14 2

7 4 17 1

8 5 18 3

7 5 21 2

8 6 17 1

10 6 25 2

11
Mean representation
If Mean alone used
as representative Number of Credit Cards
(Y) Y- (Y-)**
for the data then 4 -3 9
= Total / Number of
observations 6 -1 1

= 56 /8 = 7 6 -1 1

Total Error due to 7 0 0

mean representation 8 1 1
of the data is 22 7 0 0

8 1 1

How to reduce the 10 3 9

error and get better 0 22

representation for the


data? 12
How to reduce error?

Mean will produce lowest error term than


median or mode (if variable is normal, then all the
terms will produce equal error terms)
This error term going to play significant role in all

the statistical techniques (variance)


Need a new variable to reduce the error in the
consideration variable (dependent variable)
Variables which are related (correlation) to the
variable considered (Y-variable / dependent) may be
very useful

13
Correlation among variables

Number of Family Number of


Credit Family Size Income cars
Cards (Y) (V1) (V2) (V3)

Number of Credit
Cards
(Y) 1.0000
Family Size
(V1) 0.8664 1.0000
Family Income
(V2) 0.8290 0.6727 1.0000
Number of cars
(V3) 0.3419 0.1917 0.3008 1.0000

The variable with highest correlation is considered in


the regression 14
Prediction accuracy of regression
Prediction using Y = 2.871 + 0.971X

Number of Credit Cards


(Y) Family Size (V1) Prediction Y- (Y-)**

4 2 4.814285714 -0.81 0.663

6 2 4.814285714 1.19 1.406

6 4 6.757142857 -0.76 0.573

7 4 6.757142857 0.24 0.059

8 5 7.728571429 0.27 0.074

7 5 7.728571429 -0.73 0.531

8 6 8.7 -0.70 0.490

10 6 8.7 1.30 1.690

Prediction Error due to regression (Unexplained by regression)


0.00 5.486

15
Mathematics of regression

Total Error (due to mean representation) 22

Error in regression prediction 5.486

Error reduced by regression analysis 16.514

16.514 / 22

Error Reduction Ability of regression

0.7506

0.8664 * 0.8664
Correlation square
Coefficient of determination) R**

0.7506

16
Excel / SPSS output
Regression Statistics

Multiple R 0.866400225

ANOVA
R Square 0.750649351
df SS MS F Significance F

Adjusted R Square 0.709090909


Regression 1 16.514286 16.514286 18.0625 0.00538014

Standard Error 0.956182887 Residual 6 5.4857143 0.9142857

Total 7 22
Observations 8

Standard Lower Upper


Erro 95.0 95.0
Coefficients r t Stat P-value Lower 95% Upper 95% % %

Intercept 2.871428571 1.0285714 2.7916667 0.0315082 0.354604958 5.3882522 0.354605 5.3882522

X Variable 1 0.971428571 0.2285714 4.25 0.0053801 0.412134435 1.5307227 0.4121344 1.5307227

17
Reducing the error further

To reduce the error further in the dependent


variable we use one more independent
variable
The variable with next highest correlation
(preferably less correlation with existing
independent variable already in the model)
Correlation among three or more independent
variables referred as Multicollinearity
influence the further predictions

18
Prediction using
Y = 0.482 + 0.63 V1 + 0.216 V2

Number of Credit Family Income


Cards (Y) Family Size (V1) (V2) Y- (Y - )**

4 2 14 4.77 -0.77 0.586756

6 2 16 5.20 0.80 0.643204

6 4 14 6.03 -0.03 0.000676

7 4 17 6.67 0.33 0.106276

8 5 18 7.52 0.48 0.2304

7 5 21 8.17 -1.17 1.364224

8 6 17 7.93 0.07 0.004356

10 6 25 9.66 0.34 0.114244

0 3.050136

19
Total Error (due to mean representation) 22

Error in regression prediction 3.050

Error reduced by regression analysis 18.950

Error Reduction Ability of regression 18.950 / 22

0.8614

20
The logic
The logics are slightly different
Instead of correlation we have to
consider Partial Correlation
It is the measure of variation in
dependent variable (in the equation),
not accounted by the variable (s) in the
equation that can be accounted for by
each of these additional variables..

21
Consider the correlation & Partial correlation
Correlations

Control Variables VAR00001 VAR00003 VAR00004 VAR00002


-none- a VAR00001 Correlation 1.000 .829 .342 .866
Significance (2-tailed) . .011 .407 .005
df 0 6 6 6
VAR00003 Correlation .829 1.000 .301 .673
Significance (2-tailed) .011 . .469 .068
df 6 0 6 6
VAR00004 Correlation .342 .301 1.000 .192
Significance (2-tailed) .407 .469 . .649
df 6 6 0 6
VAR00002 Correlation .866 .673 .192 1.000
Significance (2-tailed) .005 .068 .649 .
df 6 6 6 0
VAR00002 VAR00001 Correlation 1.000 .666 .359
Significance (2-tailed) . .102 .429
df 0 5 5
VAR00003 Correlation .666 1.000 .237
Significance (2-tailed) .102 . .609
df 5 0 5
VAR00004 Correlation .359 .237 1.000
Significance (2-tailed) .429 .609 .
df 5 5 0
a. Cells contain zero-order (Pearson) correlations.

22
The logic

The highest Partial Variation accounted by fitted model 0.7506

correlation is Remaining variance 0.2494

between V2 & Y Partial correlation 0.666

(0.666) Partial correlation ** 0.4436

The incremental
Error reduction = (Remaining variance X
variance accounted Partial correlation squared)
= {0.2494 X (0.666 X 0.666)}
0.11062

by entering V2

23
Thank you

24

You might also like