You are on page 1of 24

Full-time MBA: Business Statistics (BST510)

Bivariate Regression
Analysis and Forecasting

Section three

Paul Bottomley
Bottomleypa@cardiff.ac.uk
Silver, Ch.7, pp.116-121.
1
Regression Analysis
Regression is an obvious extension to correlation. It is
useful for building statistical models and making forecasts.

The correlation (r) shows how closely data points on a


scatter plot lie to a straight line.
But two TV brands can have different price sensitivities
(slopes) yet have the same correlation. Only regression
can identify the best fitting lines.
Correlation also tells us nothing about causation.
But to undertake a regression, we must specify in
advance the direction of causation (X Y).

2
Dependent and Independent Variables
To estimate a regression, we must decide first which
variable is independent and which variable is dependent.

causes
X
Independent
Y
Dependent

Consider the following examples identify X and Y:


Car age and Car Price
Number of ice creams sold and Temperature
Customer satisfaction and Brand loyalty.
If the labels are the wrong way round, the calculation
will be wrong, yet it does not matter for correlation!
3
Dependent and Independent Variables:
Some More Examples
Think of some possible independent variables (Xs) that
might explain / predict:

Family spending on main annual holiday (Y).


Number of children
Length of holiday (1 or 2 weeks)
Home or abroad

A films box-office revenue on its opening weekend (Y).


Number of cinema screens (distribution)
Oscar winning / nominated stars or director
Certification (U, PG, X-rated)
4
Equation of a Straight Line
Examples of straight lines include:
Y = 2Xif X is 1 then Y equals 2; if X = 2 then Y = 4 etc.
Y = 3 + 2X if X is 1 then Y equals 5, if X = 2 then Y = 7
All have the same general form
Sales
(Y) *
*
Yi a bX i
*
* * * *
a *
intercept b *
*
slope *

Price (X)
5
Interpreting Regression Lines
Salary Review
Annual Total Production Costs
Salary () Costs ()
Women

Men
b b
a a

1 unit 1 unit
Years of Quantity
Service / Output
a = starting salary a = fixed costs
b = annual increment b = marginal costs

6
What is the Best Straight Line?
We could draw a line on the scatter plot by hand, but
each person would draw a different best fitting line.
A more precise method is Ordinary Least Squares.
The vertical distances between the data points and the
line are called errors. With n data points, we have n
errors, denoted as e1, e2, e3,.. en,
Sales
*
(Y) *
* ei
* * *
*
* *
ek
*
*
Price (X) 7
What is the Best Straight Line?
Obviously, a good fitting line will have small errors.
For ej the line under-predicts sales for this model of TV.
For ek the line over-predicts sales for this model of TV.
Because errors can be positive or negative, so they dont
cancel each other out, we square each error.
Ordinary Least Squares (OLS) fits the line that minimises:
n
e e e ... e min e
2
1
2
2
2
3
2
n
2
i
i 1

A perfect line has zero error each point is on the line!


The best fitting line becomes our model of the world.

8
The Regression Coefficients
Least squares estimates for the intercept (a) and slope (b)

n XY X Y Covariance XY
b
n X ( X )
2 2
Variance X

_
a Yb X
_

Y X
b


n n

These formula give the best fitting straight line. Calculations
are based on same table used for the correlation (r).
9
Demand for Nikkai Televisions
Sales (Y) Price (X) 250
213 132
200
192 181
168 200 150

Sales
160 149 100
119 191
50
96 163
79 220 0
0 50 100 150 200 250 300
74 186
Price ()
68 260

Nikkai targets a price sensitive market segment. The


relationship between sales and price appears linear with
no obvious outliers (TV models that dont fit the pattern).
10
Nikkai: Sales = f(Price);
Correlation Coefficient Revisited
Sales (Y) Price (X) XY X2 Y2
213 132 28116 17424 45369
192 181 34752 32761 36864
168 200 33600 40000 28224
160 149 23840 22201 25600
119 191 22729 36481 14161
96 163 15648 26569 9216
79 220 17380 48400 6241
74 186 13764 34596 5476
68 260 17680 67600 4624
Totals
1169 1682 207509 326032 175775

11
Calculating Regression Coefficients
(9 * 207509) (1682 *1169)
r 0.656
[(9 * 326032) 1682 ][(9 *175775) 1169 ]
2 2

Pearsons Correlation stronger relationship than B&O.

n XY X Y (9 * 207509) (1682 *1169)


b
n X 2 ( X ) 2 (9 * 326032) 16822

b = -98677 / 105164 = -0.94 (slope or gradient)


_ _
1169 1682
a Yb X ( 0.94) 305.55
9 9
a = 305.55 (intercept) Beware, double negative!
12
Nikkai: Sales = f(Prices)
350
Intercept = 305.55
300

250
Unit Sales

200

150
Slope = -0.94
100

50

0
0 50 100 150 200 250 300 350
Price ()

The regression of sales on prices: Y = 305.55 0.94xPrice

Note: regression dips below horizontal X-axis when price > 325
13
Interpreting the Regression Line
Nikkai: Y^ = 305.55 0.94xPrice

Performing a similar analysis for B&O:


^
B&0: Y = 207.57 0.09xPrice
Intercept (a) tells us the expected number of sales if the
price were zero: 306 units when giving TVs away!
Slope (b) tells us the impact of a one unit change in X on Y.
If price rises by 1 unit, we expect sales to fall by 0.94 units.
But, we must convert this into actual values of X and Y.
If price rises by 1, we expect sales to fall by 0.94 TV sets.
14
Interpreting the Regression Line

Nikkai: Y^ = 305.55 0.94xPrice

Intercept (a) tells us the value of Y when X = 0.


In this context, it shows the expected number of sales if the
price were zero (=0): 306 units when giving TVs away!
Slope (b) tells us the impact of a one unit change in X on Y.
If price rises by 1 unit, we expect sales to fall by 0.94 units.
In this context, considering the actual units of X and Y
If price rises by 1, we expect sales to fall by 0.94 TV sets.
^
B&0: Y = 207.57 0.09xPrice
1515
A More Complicated Example
Imagine a regression model of the relationship between
advertising expenditure (1000) and sales revenue (10,000).
^
Y = 1.20 + 0.54 x Ad_Spend
Intercept a: if ad_spend were zero, we would expect (^)
sales revenue to be 12,000 (1.2 x 10,000).
Slope b: if ad_spend increases by 1 unit, we expect sales
revenue to increase by 0.54 units.
Ad_spend rises by 1,000, sales revenue will rise by 5,400.
(0.54 x 10,000).

Always consider X and Ys units of measurement


16
How Well Does the Regression Model
Predict / Forecast Sales ()?

We can use the regression to predict TV sales at different
prices by substituting values of X (prices) into the equation.

Q: What would expected sales be if price were 200?

Y^ = 305.55 (0.94x200) = 117.55 units


Q: What would expected sales be if price were 100?
^ = 305.55 (0.94x100) = 211.55 units
Y
All forecasts / predictions lie on the regression line, but the
regression doesnt necessarily pass through all data points.
17
17
How Well Does the Regression Model
Predict / Forecast Sales ()?

We can use the regression to predict TV sales at different
prices by substituting values of X (prices) into the equation.

Q: What price should we set, if we want sales of 200 units?


^Y = 305.55 (0.94X) =

200 = 305.55 (0.94X) = and solve for X

0.94X = 105.55, so X = 105.55 / 0.94 = 112.29

18
Nikkai: Actual vs. Predicted Sales
350

300

250
Unit Sales

200

150

100

50

0
0 50 100 150 200 250 300 350
P1 Price P2

It is risky making predictions beyond the given data range (X =


132 to 260) as we must assume the relationship still holds true.
But the line will always pass through the mean of X and Y.
19
Coefficient of Determination (R2)
The R2 = correlation squared (r2). Index: range 0 to 1

The higher R2, the stronger the _


association between Y and X, Total deviation(Yi Y)
and more confidence we have
in our predictions.
Y *
Unexplained
Shows proportion of variance in deviation (b)
Y explained by the regression Explained
line. (a)
^
Y deviation (a)
_ _
Unexplained variation between Y Y
data points and the line. This is
what we dont know. (b)
Nikkai: Regression Line

R2 = r2 = (-0.656)2 = 0.43 X
So 57% variance is unexplained whats missing from our model?
20
Some Thoughts and Reflections
Guidelines: With cross-sectional data an R2 of 0.4 is OK;
with time-series data an R2 should approach 0.8 or 0.9.

But, dont simply look at the fit (R2) of the model, and the
impact of the variable(s) included in the model...
Ask yourself: has the market researcher included the main
drivers? What variables are missing?

We have assumed that TV sales depend only on prices.


It would be more realistic to assume that sales depend on
price, advertising, price of competing products etc.

21
Problems and Assumptions
Underlying Regression Analysis
Issues affecting interpretation of regression coefficients.
Omitted variables
Use of dummy variables Try asking an
Non-linear relationships Econometrician!

Issues affecting the quality of estimates.


multi-collinearity; serial correlation
heteroscedasticity; measurement error

And the errors (ei) should be independent, normally


distributed, with a mean of zero, and constant variance.
22
Are the Errors Normally Distributed
- Why Not Use Tukeys Box-Plot?

* Mean = 0.000 600

* Min. = -359.24 400

* Max. = 496.94
* Median = -78.07 200

* Q1 = -174.15 0

* Q3 = 196.34
-200
* IQR = 370.49
* Lower hinge = Q1 - (1.5*IQR)
-400

* Upper hinge = Q3 + (1.5*IQR)


-600

= -729 and 752.08 N= 9

Unstandardized Resid

23
Dr Saeed Heravi (An Econometrician)!
MBA dissertations with
quantitative emphasis
Time series forecasting
Other multivariate models
Cluster analysis for market
segmentation
Perceptual maps for brand
positioning
DEA for store performance
(e.g. Ann Summers)

24

You might also like