BUSI 410 Business Analytics: Module 18: Identifying Drivers of Business Outcomes

BUSI 410
Business Analytics
Module 18: Identifying Drivers of

Business Outcomes
1
Last lecture
• Confidence interval for population mean
• Approximate 95% CI for independent-data

population mean difference
• CI for paired-data population mean difference
• Conservative, Approximate 95% CI for population

proportion
2
Business decision in action:
Identifying drivers of business outcomes
Starbucks currently owns 24,000 retail outlets in 72 countries.
When choosing a new location, Starbucks carefully examines the
profitability of each candidate location.
Sample factors affecting revenue:

(1) Demographics (population,
age, income, etc.)
(2) Nearby Starbucks stores
(3) Nearby office buildings
(4) Nearby colleges
3
Regression is useful in
explaining phenomena and forecasting
• Regression helps explain drivers of performance

– What drives defects in a factory?
– What drives profits of Starbucks stores?
– What determines employees’ pay?
• Regression helps forecast performance

– Profit, sales, etc.
4
Two types of data
• Cross sectional data – collected on multiple

subjects at the same time (e.g., Oct sales and street
traffic density of 100 Starbucks stores)
• Time series data – collected on one subject over

time (e.g., Sales and street traffic density over 100
months of a certain Starbucks store)
5
Hybrid auto sales
We want to quantify the impact of gas price (independent

variable) on hybrid sales (dependent variable)
Linear regression: Finding the LINE that minimizes the mean
square vertical difference from all sample points to this line
6
Hybrid auto sales:
Linear regression
From the Data tab open the Data Analysis

tools and select Regression. Then specify:
1. The Input Y Range (hybrid sales)
2. The Input X Range (gas price)
3. Check the “Labels” box if appropriate.
4. Designate where the output is to go.
5. Check the “Residuals” and “Residual
Plots” boxes.
6. Click OK. 7
Hybrid auto sales:
Regression output
Predicted hybrid sales is the dependent variable, gas

price is the independent variable, and the equation for
the relationship between the two is:
hybrid sales (K) = –13.8 + 15.2 gas price ($/gal)
8
Hybrid auto sales:
Gas price drives hybrid sales
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price ($ per gallon)
9
Hybrid auto sales:
Predict hybrid sales using gas price
50.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
10
Hybrid auto sales:
Residuals
50.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Residual: differences between actual value & prediction
11
What exactly did we do?
• Assumed gas price can explain hybrid sales in a

linear way:
hybrid sales = 𝛽0 + 𝛽1 ∗ gas price + 𝜀
where random variable 𝜀 represents the impact of
all other factors
• Found the best fitting (minimum mean square error,

MMSE) line based on a sample
– The estimated intercept and coefficient are random
variables
12
What else can we do?
• If we assume 𝜀 is normal, has mean zero, is

independent of the driver, and is independent
over time, then we can use the sample to
– Test whether 𝛽1 is non-zero
– Construct confidence interval for 𝛽1
– Construct confidence intervals for our prediction of the

dependent variable
13
Hypothesis test and CI for 𝜷𝟏
p-value is for the hypothesis:

H0: 𝛽1 = 0 vs. H1: 𝛽1 ≠ 0
(meaning this is a two-tail p-value)
95% confidence interval for 𝛽1 inculded

(we can also specify any significance level)
14
Hypothesis test and CI for 𝜷𝟏
• If reject 𝛽1 = 0: We can conclude that gas price

drives hybrid sales (driver is significant)
• If cannot reject 𝛽1 = 0: We cannot conclude that
gas price drives hybrid sales (driver is insignificant)
• (Leave 𝛽0 alone!)
• 95% CI for 𝛽1 is [8.7, 21.7]: With 95% confidence,

a $1 increase in gas price will cause monthly hybrid
sales to increase between 8.7k and 21.7k units
15
Constructing prediction
interval for the dependent variable
• For confidence level 1 − 𝛼, the following interval

contains the actual value of 𝑦 (for a given 𝑥):
# of independent variables Std. error of
(=1 for simple regression) the prediction
𝑦ො ±T.INV.2T(𝛼, 𝑁 − 𝑘 − 1)*SE
(Mean) prediction # of observations
• Approximate 95% PI: 𝑦ො ± 2*SE
16
R Square and Standard Error
The Regression Statistics section of the output tells us:

1. The meaning of R Square – the proportion of variation in
the dependent variable (y) explained by the variation in the
independent variable (x)
2. Standard Error (of the estimate) – the standard deviation of
the residuals of the regression, useful for constructing CI
3. Observations – the number of pairs of data, (x, y) used to
run this linear regression
17
“ANOVA” –
Analysis of Variance
Significance F – the p-value for H0: all slopes are

zero (R Square = 0) vs. H1: at least one slope is
non-zero (R Square > 0). In simple linear
regressions, Significance F = driver p-value
18
Regression equation in
standard format
Mark prediction Specify units
෣sales(k) = −13.8 + 15.2∗ gas price ($/gal)

hybrid
Specify R Square
R Square: .5∗
Specify significance of
each driver and
∗
Significant at .05 model as a whole (R2)
1. Use required or a common significance level

2. Use multiple significance levels if necessary,
including insignificant
19
How it’s done in real life
20
Regression and correlation
• Correlation (correl(x,y) in Excel):

σ𝑖 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ𝑖 𝑥𝑖 − 𝑥ҧ 2 σ𝑖 𝑦𝑖 − 𝑦ത 2
2
• In simple linear regressions, R Square = 𝑟𝑥𝑦
• Correlation ≠ regression coefficient
– Correlation measures how much two values tend to
move together; bounded by [-1,1]
– Regression coefficient measures how much an
outcome changes with a unit change of a driver;
unbounded
21
Correlation ≠ causation
• A may cause B: Cold weather correlated with flu

pandemic
• B may cause A: Flu pandemic correlated with cold
weather
• C may cause A and B: Coke sales correlated with
gas price
• Mixture of all: Poor education correlated with
poverty
• Use intuition to build linear regression models; use
caution when interpreting results
22
Confirming assumptions
on residuals
We assumed 𝜀 is normal, has mean zero, is

independent of the driver, and is independent
over time. Is it so?
14 60% residuals
distribution % if Normal
7
Residuals
40%
0
1.90 2.20 2.50 2.80 3.10
-7 20%
-14
gas price ($ per gallon)
0%
Skew: 1.2
23
Linear regression is like
fitting a watermelon in a box…
You can always find the best fit, but sometimes

even the best fit is a bad one
24
For next class
• Group Case Project due 11/7 prior to class
• Read Chapter 10 of the textbook
25

BUSI 410 Business Analytics: Module 18: Identifying Drivers of Business Outcomes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BUSI 410 Business Analytics: Module 18: Identifying Drivers of Business Outcomes

Uploaded by

Copyright:

Available Formats

BUSI 410

Module 18: Identifying Drivers of

• Confidence interval for population mean

• Approximate 95% CI for independent-data

• CI for paired-data population mean difference

• Conservative, Approximate 95% CI for population

Sample factors affecting revenue:

• Regression helps explain drivers of performance

• Regression helps forecast performance

• Cross sectional data – collected on multiple

• Time series data – collected on one subject over

We want to quantify the impact of gas price (independent

From the Data tab open the Data Analysis

Predicted hybrid sales is the dependent variable, gas

• Assumed gas price can explain hybrid sales in a

• Found the best fitting (minimum mean square error,

• If we assume 𝜀 is normal, has mean zero, is

– Test whether 𝛽1 is non-zero

– Construct confidence interval for 𝛽1

– Construct confidence intervals for our prediction of the

p-value is for the hypothesis:

95% confidence interval for 𝛽1 inculded

• If reject 𝛽1 = 0: We can conclude that gas price

• 95% CI for 𝛽1 is [8.7, 21.7]: With 95% confidence,

• For confidence level 1 − 𝛼, the following interval

(Mean) prediction # of observations

• Approximate 95% PI: 𝑦ො ± 2*SE

The Regression Statistics section of the output tells us:

Significance F – the p-value for H0: all slopes are

෣sales(k) = −13.8 + 15.2∗ gas price ($/gal)

1. Use required or a common significance level

• Correlation (correl(x,y) in Excel):

• A may cause B: Cold weather correlated with flu

We assumed 𝜀 is normal, has mean zero, is

You can always find the best fit, but sometimes

• Group Case Project due 11/7 prior to class

• Read Chapter 10 of the textbook

You might also like