You are on page 1of 14

STATISTICS 13

Lecture 9
Apr 16, 2010
Review
 Scatterplot
Pattern, direction
Strength
Unusual observations?
 Correlation coefficient r
--properties
--use when there is a linear association
Linear Regression
 Sometimes, the value of Y may be thought of as
being dependent on the value of X
 Interest is in describing how Y depends on X
 Example: data for the heights of 1078 fathers
and sons.
X—height of father; Y—height of son. Y is
dependent on X in some way
 If the relationship between X and Y is linear
(scatter plot is football shaped), we can fit a
straight line to the data
 The straight line that “best fits” the data, is
called the regression line of Y on X :
y = a + b x (a = intercept, b = slope)
Example of Football
Shaped Scatterplot
Fitting the Regression
Line
 To find the slope and the intercept of the
regression line that best fits the data use :

sy
b=r
sx
a = y − bx
 This is the line that minimizes the sum of squared
distances (in the y-direction) from the line, so the
regression line is also called the least-square line
Example : Airline Passenger
Booking vs. Hotel Occupancy
 Data on the airline passenger booking and hotel occupancy
rate near Orlando, Florida
 X = thousands of passengers booked for airline flights by
Eastern Airlines to Orlando International Airport
 Y = occupancy rate for Walt Disney World area hotels (in
%)

X 65.7 71.6 53.7 70.2 75.0 85.6 84.6 58.0 72.8 87.6 85.4 50.6

Y 40 41 48 49 73 74 68 51 63 75 70 38

(Source : Florida Department of Business Regulation, Orlando Area Chamber of


Commerce, and Finance Dept. of Orlando International Airport)
Scatter Plot
Scatterplot of occupancy (%) vs booking (thousands people)
80

70
occupancy (%)

60

50

40

50 60 70 80 90
booking (thousands people)
Example (cont.)
x = 71.73 s x = 12.82
Scatterplot of occupancy (%) vs booking (thousands people)

y = 57.5 s y = 14.39 80

70

r = .819

occupancy (%)
60
(71.73,57.5) ●
50

sy 14.39
b=r = (. 819) = .9198 40
sx 12.82
50 60 70 80 90

a = y − bx= 57.5 − .9198(71.73) = −8.48 booking (thousands people)

Regression Line : y = −8.48 + .9198 x

Note: the regression line of y on x is generally different from


the regression line of x on y!
Interpretation of
Regression Results
 The regression line for y on x estimates the
average value for y corresponding to each value
of x and can be used for prediction
 With each increase of one SD in x there is an
increase of only r SDs in y, on the average.
 The point ( x, y ) is always on the fitted
regression line
 Example: “Booking vs. Hotel Occupancy” . You
are told that there are 70 thousands passengers
and asked to predict the occupancy rate.
 Answer:
Example: HANES
 Health and Nutrition
Examination Survey
(from Statistics by Freedman,
Data:
Pisani and Purves):

heights and weights


of 988 men age 18-
24
 The scatterplot is
football shaped.
Example: HANES (cont.)
 Summary of Data:
-average height=70in, SD=3 in
-average weight=162p, SD=30 p
-r=0.47
 Question: suppose one of these men is picked at
random, and you have to guess his weight without
being told anything about him, what would be your
guess? How about if you are told the man’s height:
73 in, what would be your guess then?
 Answer:
 without any information, the best guess is
 with the height 73 in, the best guess is
Regression Effect
 Example in first day: A preschool program attempts to
boost children’s IQ. The children are tested when they
enter the program , and again when they leave. On both
occasions, the average score is nearly 100. The program
seems to have no effect. However, a closer look at the data
shows that the children who were below average on the pre-
test had an average gain of about 5 points, and those
children who were above average on the pre-test had an
average loss of about 5 points. Does the program operate to
equalize intelligence?
 Answer:

 In most test-retest situation, the bottom group on the first


test will on average show some improvement on the second
test; and the top group will on average fall back. This is the
regression effect (from “Statistics” by Freedman, Pisani and Purves)
Regression Effect (Cont.)
 Regression effect is first noticed by Sir Francis
Galton (1822-1911) in his study of family
resemblances
 In a study of heights of 1078 pairs of fathers and
sons, the summary statistics are:
-average height of fathers=68in, SD=2.7in
-average height of sons=69in, SD=2.7in
-r=0.5
 The sons average 1 inch taller than father. It is
maybe nature to guess a 72-in father should have
a 73-in son; and a 64-in father should have a 65-in
son. Is this true?
Explanation of Regression
Effect
 With each increase of one SD in x there is an increase of
only r SDs in y, on the average; and note that | r |≤ 1
 Therefore suppose measurement x is 3 SD above the
average of x values, and r=0.5; then on average, the
corresponding value of y will only be 1.5 SD above the
average of y values.
 The above is because the scatterplot is not a straight
line (which has |r|=1), but rather with data scattering
around the regression line (which has |r|<1)
 A crude model: observed test score=true score +chance
error; If a person scores very high on the first test,
then this person is probably lucky (observe score >true
score), and the score on the second test will probably be
lower (you can not always be lucky-)

You might also like