Professional Documents
Culture Documents
7-2
Introduction
Scientists and engineers often collect data in order to
determine the nature of the relationship between two
quantities.
For example, a chemical engineer may run a chemical
process several times in order to study the
relationship between the concentration of a certain
catalyst and the yield of the process.
Each time the process is run, the concentration x and
the yield y are recorded.
McGraw-Hill
7-3
Introduction
In many cases, the ordered pairs of measurements fall
approximately along a straight line when plotted.
In those situations, the data can be used to compute
an equation for the line that best fits the data.
This line can be used for various things. In our
catalyst versus yield experiment, the line could be
used to predict the yield that will be obtained the next
time the process is run with a specific catalyst
concentration.
McGraw-Hill
7-4
McGraw-Hill
7-5
McGraw-Hill
7-6
Example
Correlation Coefficient
7-7
McGraw-Hill
7-8
Computing r
Let (x1, y1),,(xn, yn) represent n points on a scatterplot.
( xi x ) sx
McGraw-Hill
( yi y ) s y
7-9
Computing r
1 n xi x yi y
r
n 1 i 1 sx s y
McGraw-Hill
7-10
Computational Formula
This formula is easier for calculations by
hand:
n
x y
i 1
x
i 1
McGraw-Hill
2
i
nx
nx y
n
y
i 1
2
i
ny
7-11
Comments
In principle, the correlation coefficient can be
calculated for any set of points.
In many cases, the points constitute a random sample
from a population of points.
In this case, the correlation coefficient is called the
sample correlation, and it is an estimate of the
population correlation.
McGraw-Hill
7-12
Properties of r
It is a fact that r is always between 1 and 1.
Positive values of r indicate that the least squares line
has a positive slope. The greater values of one
variable are associated with greater values of the
other.
Negative values of r indicate that the least squares
line has a negative slope. The greater values of one
variable are associated with lesser values of the other.
McGraw-Hill
7-13
More Comments
Values of r close to 1 or 1 indicate a strong linear
relationship.
Values of r close to 0 indicate a weak linear
relationship.
When r is equal to 1 or 1, then all the points on the
scatterplot lie exactly on a straight line.
McGraw-Hill
7-14
More Comments
If the points lie exactly on a horizontal or vertical
line, then r is undefined.
If r 0, then x and y are said to be correlated. If
r = 0, then x and y are uncorrelated.
For the scatterplot of height vs. forearm length,
r = 0.80.
McGraw-Hill
More Properties of r
7-15
More Properties of r
7-16
McGraw-Hill
7-17
Example 1
An environmental scientist is studying the rate of absorption of a
certain chemical into skin. She places differing volumes of the
chemical on different pieces of skin and allows the skin to remain
in contact with the chemical for varying lengths of time. She
then measures the volume of chemical absorbed into each piece
of skin. The scientist plots the percent absorbed against both
volume and time. She calculates the correlation between volume
and absorption and obtains r = 0.988. She concludes that
increasing the volume of the chemical causes the percentage
absorbed to increase. She then calculates the correlation between
time and absorption, obtaining r = 0.987. She concludes that
increasing the time that the skin is in contact with the chemical
causes the percentage absorbed to increase as well. Are these
conclusions justified?
McGraw-Hill
7-18
Example 1
McGraw-Hill
7-19
McGraw-Hill
7-20
Distribution
Let (x1, y1),, (xn, yn) be a random sample from the
joint distribution of X and Y and let r be the sample
correlation of the n points. Then the quantity
1 1 r
W ln
2 1 r
and variance
1
.
n3
2
W
McGraw-Hill
7-21
Example 2
In a study of reaction times, the time to respond to a
visual stimulus (x) and the time to respond to an
auditory stimulus (y) were recorded for each of 10
subjects. Times were measured in ms.
x 161 203 235 176 201 188 228 211 191 178
y 159 206 241 163 197 193 209 189 169 201
Find a 95% confidence interval for the correlation
between the two reaction times.
McGraw-Hill
7-22
Example 2 cont.
Find the P-value for testing H0: 0.3 versus
H1: > 0.3.
McGraw-Hill
7-23
7-24
The quantities
coefficients.
0 and 1
Residuals
7-25
For each data point, (xi, yi) the vertical distance to the
point (xi, y i ) on the least squares line is ei = yi y i . The
y i
quantity is called the fitted value and the quantity ei
is called the residual associated with the point (xi, yi).
The residual ei is the difference between the value yi
observed in the data and the fitted value predicted by the
least-squares line.
McGraw-Hill
7-26
Residuals
Points above the least-squares line have positive
residuals, and points below the line have negative
residuals.
McGraw-Hill
7-27
(x
i 1
x )( yi y )
and
n
2
(
x
x
)
i
i
0 y . 1 x
i 1
7-28
i 1
i 1
2
2
(
x
x
)
x
i
i nx
n
2
(
y
y
)
y
i
i ny
i 1
(x
i 1
McGraw-Hill
i 1
x )( yi y ) xi yi nx y
i 1
7-29
Example 3
Using the data in the table below, compute the leastsquares estimates of the spring constant and the
unloaded length of the spring. Write the equation of the
least-squares line. Estimate the length of the spring
under a load of 1.3 lb.
McGraw-Hill
7-30
Cautions
Do not extrapolate the fitted line (such as the leastsquares line) outside the range of the data. The linear
relationship may not hold there.
We learned that we should not use the correlation
coefficient when the relationship between x and y is
not linear. The same holds for the least-squares line.
When the scatterplot follows a curved pattern, it does
not make sense to summarize it with a straight line.
If the relationship is curved, then we would want to
fit a regression line that contain squared terms.
McGraw-Hill
7-31
1 r
The units of
sy
sx
McGraw-Hill
7-32
McGraw-Hill
7-33
r
2
2
2
(
y
y
)
(
y
y
)
i
i i
i 1
i 1
2
(
y
y
)
i
i 1
7-34
Sums of Squares
2
(
y
y
)
i 1 i
is the total sums of squares and measures
the overall spread of the points around the line y y .
n
(
y
y
)
(
y
y
)
i 1 i
The difference i 1 i
is called
n
McGraw-Hill
Sums of Squares
7-35
McGraw-Hill
7-36
2.
3.
4.
McGraw-Hill
Distribution
7-37
y 0 1 xi
i
2
yi
More Distributions
7-38
Under assumptions 1 4:
The quantities 0 and 1 are normally distributed
random variables.
The means of 0 and 1 are the true values 0 and 1,
respectively.
McGraw-Hill
7-39
x
)
i
s
1
i 1
where
(1 r )
2
(
x
x
)
i
i 1
( y
i 1
y )2
n2
is an estimate of the
7-40
Example 2 cont.
Using the data in Example 2, calculate s,
s , s .
1
McGraw-Hill
7-41
Notes
McGraw-Hill
Confidence Intervals
7-42
s
1
n 2, / 2 y , where
is given by 0
1
( x x )2
s y
n
.
2
n
(
x
x
)
i
i 1
McGraw-Hill
7-43
Prediction Interval
A level 100(1 )% prediction interval for
0+1x is given by 0 1 x tn 2, / 2 s pred , where
s pred
1
( x x )2
1 n
.
2
n
( xi x )
i 1
McGraw-Hill
7-44
Example 2 cont.
For the data in Example 2, compute a 95% confidence
interval for the slope of the regression line.
McGraw-Hill
7-45
McGraw-Hill
7-46
McGraw-Hill
7-47
McGraw-Hill
7-48
7-49
McGraw-Hill
7-50
McGraw-Hill
7-51
McGraw-Hill
7-52
Residual Plots
7-53
A: No noticeable pattern
B: Heteroscedastic
C: Trend
D: Outlier
McGraw-Hill
7-54
McGraw-Hill
7-55
7-56
Dont Forget
Once the transformation has been completed, then
you must inspect the residual plot again to see if that
model is a good fit.
It is fine to proceed through transformations by trial
and error.
It is important to remember that power
transformations dont always work.
McGraw-Hill
7-57
Caution
When there are only a few points in a residual plot, it
can be hard to determine whether the assumptions of
the linear model is met.
When one is faced with a sparse residual plot that is
hard to interpret, a reasonable thing to do is to fit a
linear model, but to consider the results tentative,
with the understanding that the appropriateness of the
model has not been established.
McGraw-Hill
7-58
Outliers
Outliers are points that are detached from the bulk of
the data.
Both the scatter plot and the residual plot should be
examined for outliers.
The first thing to do with an outlier is to determine
why it is different from the rest of the points.
McGraw-Hill
7-59
Outliers
Sometimes outliers are caused by data-recording
errors or equipment malfunction. In this case, the
outlier can be deleted from the data set. In this case,
you may present results that do not include the
outlier.
If it cannot be determined why there is an outlier,
then it is not wise to delete it. Here the results
presented, should be the ones from analysis with the
outlier included in the data set.
McGraw-Hill
Influential Point
7-60
McGraw-Hill
Influential Point
7-61
McGraw-Hill
Influential Point
7-62
McGraw-Hill
Comments
7-63
Comments
7-64
McGraw-Hill
7-65
Independence of Observations
If the plot of residuals versus fitted values looks
good, then further diagnostics may be used to further
check the fit of the linear model.
A time order plot of the residuals versus order in
which observations were made.
If there are trends in this plot, then x and y may be
varying with time. In this case, adding a time term to
the model as an additional independent variable.
McGraw-Hill
7-66
Normality Assumption
To check that the errors are normally distributed, a
normal probability plot of the residuals can be made.
If the plot looks like it follows a rough straight line,
then we can conclude that the residuals are
approximately normally distributed.
McGraw-Hill
7-67
Comments
Physical laws are applicable to all future
observations.
An empirical model is valid only for the data to
which it is fit. It may or may not be useful in
predicting outcomes for subsequent observations.
Determining whether to apply an empirical model
to a future observation requires scientific
judgment rather that statistical analysis.
McGraw-Hill
Summary
7-68
We discussed
correlation
least-squares line / regression
uncertainties in the least-squares coefficients
confidence intervals and hypothesis tests for leastsquares coefficients
checking assumptions
residuals
McGraw-Hill