Professional Documents
Culture Documents
WELCOME TO MY CLASS!
Two statisticians were traveling in an airplane from Los Angeles to
New York City. About an hour into the flight, the pilot announced
that although they had lost an engine, there was no need for worry
as the plane had three engines left. However, instead of 5 hours
travel time it would now take them 7 hours to get to New York. A
short while later, the pilot announced that a second engine failed.
They still had two left, but it would take 10 hours to get to New
York. Somewhat later, the pilot announced that a third engine had
died. Never fear, he announced, because the plane could fly on a
single engine. However, it would now take 18 hours to get to New
York. At this point, one statistician turned to the other and said,
"Gee, I hope we don't lose that last engine, or we'll be up here
forever!"
p. 3
p. 4
A. Sources of Data
1. Primary data: Data compiled by the researcher.
Surveys, experiments, depth interviews, observation, focus groups.
Much data is obtained via surveys (uses a questionnaire).
Types of surveys:
Mail: lowest rate of response; usually the lowest cost
Personally administered: can probe; most costly; interviewer effects
Telephone: fastest
Web
2. Secondary data: Data compiled or published elsewhere, e.g., census data.
Secondary data are data that were developed for some purpose other than helping to
solve the problem at hand.
Advantages: It can be gathered quickly and inexpensively. Allows researchers to
build on past research.
Problems: Data may be outdated. Variation in definition of terms. Different units of
measurement. May not be accurate (e.g., census undercount).
Typical Objectives for secondary data research designs:
(1) Fact Finding, eg- amount spend by industry and competition on advertising; market share; #
of computers with modems in U.S., Japan,
(2) Model Building - specify relationships between two or more variables. Often
using descriptive or predictive equations. Used, eg, to measure market potential, as per
capita income + # cars bought in various countries.
p. 6
B. Survey Errors
I. Response Errors
a) subject lies question may be too personal or subject tries to give the
socially acceptable response
b) subject makes a mistake subject may not remember the answer
c) interviewer makes a mistake in recording or understanding subjects
response
d) interviewer cheating
e) interviewer effects vocal intonation, age, sex, race, clothing,
mannerisms of interviewer may influence response
II Nonresponse error
If the rate of response is low, the sample may not be representative. The people
who respond may be different from the rest of the population. Usually,
respondents are more educated and more interested in the topic of the survey.
Thus, it is important o achieve a reasonable high rate of response. Use followups.
Which is better?
Sample 1
n = 2,000
rate of response = 90%
Sample 2
n = 1,000,000
rate of response = 20%
p. 8
C. Types of Samples
I Nonprobability Samples based on convenience or judgment
1. Convenience (or chunk) sample
students in a class, mall intercept
2. judgment sample
based on the researchers judgment as to what constitutes representativeness
e.g., he/she might say these 20 stores are representative of the whole chain.
3. quota sample
interviewers are given quotas based on demographics for instance, they are each
told to interview 100 subjects 50 males and 50 females. Of the 50, say, 10
nonwhite and 40 white.
The problem with a nonprobability sample is we do not know how representative
our sample is of the population.
II Probability Samples
Probability Sample: A sample collected in such a way that every element in the
population has a known chance of being selected.
1. Simple Random Sample: A sample collected in such a way that every element
in the population has an equal chance of being selected.
2. systematic random sample
3. stratified sample
4. cluster sample
-------------------------------------------------------------------------------------------
N= population size
n= sample size
suppose N=800 and n=80.
First, number all the elements in the population from 001 to 800. Then go to the
random number table or, more likely, use a random number generator and
select 80 3-digit random numbers. Discard numbers greater than 800.
p. 9
Note that every element in the population has an equal chance of being selected for
the sample: n/N or 80/800 = 10%.
p. 10
Row
00000
12345
00001
67890
11111
12345
11112
67890
Column
22222
22223
12345
67890
33333
12345
33334
67890
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
66194
78240
00833
12111
47189
76396
46409
74626
34450
36327
74185
12296
90822
72121
95268
92603
18813
38840
05959
85141
75047
30752
22986
99439
20389
39249
96777
04860
41613
17930
24649
79899
76801
36239
07392
67133
77759
85992
79553
41101
36191
62329
14751
48462
29435
28340
02167
17864
79675
72335
28926
43195
88000
86683
99951
72486
17469
22111
81974
72135
77536
41623
60280
79152
41377
09091
90291
26903
33836
21155
59643
95260
82575
86692
93029
05173
33605
32918
42375
00794
31845
34061
49594
63636
64449
04181
31504
72268
75952
17336
17095
63898
13151
59278
88105
29285
58940
00991
80605
82037
99547
24837
67299
61270
05755
62423
32483
87286
93723
33005
84825
62873
88925
96591
25684
75884
05275
28624
53758
99212
31074
68032
42187
90348
11881
68256
29481
10798
00403
53836
25736
54308
81002
38140
17886
33874
32832
42920
54116
48951
32123
23268
93115
44185
59651
12965
27149
39557
60059
92003
16625
32511
68215
58036
03834
27618
09083
46772
49023
28701
09934
37943
99610
90305
08151
93424
01223
67157
16562
32685
38172
62871
62295
66036
71685
36359
20063
50492
03656
53692
75231
59358
30397
65731
63632
98835
70861
20810
65553
53674
91576
74283
01437
29616
44391
14821
80242
54981
35862
34100
45515
70880
11274
64192
43782
84184
76175
42243
58432
34710
99103
25584
42772
10189
61816
72586
79607
51986
41081
51403
03718
58781
84295
48399
65452
20250
09398
52655
77580
67135
83808
56462
52728
39788
53995
67453
15152
29361
47139
17880
84221
26091
56945
76537
74588
80425
10587
23588
00254
29879
12108
52622
32991
15145
40282
73561
26309
44250
36876
50693
67389
63360
76873
68016
54305
30061
34900
14508
41230
69813
69506
68790
66562
26698
63669
05947
35139
94713
86877
61912
93829
97302
72070
38971
22247
76381
75371
90306
09165
08575
82010
69704
67680
83139
80834
44653
34959
37609
21545
89720
57846
61881
17436
01748
51417
52818
91536
42439
93391
89311
45869
47270
04117
13747
86189
14457
09778
49315
20528
58781
67143
69766
31442
39437
02656
09335
61344
28393
57085
11246
99430
86828
33706
53363
62607
63455
39174
73574
85490
49321
30847
82267
79790
28454
85686
70467
75339
13128
78179
13274
67953
22070
55624
90611
90599
78922
19985
68046
67083
49359
09325
09609
60561
79778
58555
88903
95426
42865
38012
31926
32119
34143
30634
73451
89047
68686
01843
33359
87772
98102
98917
58166
15101
06872
17574
59734
29733
51423
60579
45260
78902
68409
89661
19589
55114
16602
79786
81914
36546
46613
p. 11
Note: The formulas for (1) and (3) are the same. The formula for (2) requires a
finite population correction factor.
p. 12
D. Data
1. Types of Data
Qualitative data result in categorical responses.
Called Nominal, or categorical data
Example:
Sex
MALE
FEMALE
p. 13
2. Levels of Data
NOMINAL
ORDINAL
INTERVAL
RATIO
NOMINAL
same as Qualitative
Classification categories
When objects are measured on a nominal scale, we can only say that one is
different from the other
Examples: sex, occupation, ethnicity, marital status, etc.
[Question: What is the average SEX in this room? What is the
average RELIGION?]
Appropriate statistics: mode, frequency
We cannot get an average. The average sex in this class makes no sense!
Example:
Say we have 20 males and 30 females. The mode the data value that occurs
most frequently - is female.
Frequencies: 60% are female.
Say we code the data 1 for male and 2 for female:
(20 x 1 + 30 x 2) / 50 = 1.6
Is the average sex = 1.6? What are the units? 1.6 what? What does 1.6 mean?
Note: Ebert and Roepers thumbs up / down is nominal type of data. Nominal
data is weak. Better if we use all five fingers.
p. 14
ORDINAL
Ranking, but the intervals between the points are not equal
We can say that an objects has more or less of the characteristic than another
object
Examples: social class, hardness of minerals scale, income as categories, class
standing, rankings of football teams, military rank (general, colonel, major,
lieutenant, sergeant, etc.), hurricane rankings (category 1, 2, , category 5)
Example:
Income
Under $20,000 checked by, say, John Smith
$20,000 $49,999 checked by, say, Jane Doe
$50,000 and over checked by, say, Bill Gates
Bill Gates checks box 3 even though he earns several billion dollars. Distance
between Gates and Doe is not the same as the distance between doe and smith.
Appropriate statistics
same as those for nominal data, plus
median, but not mean
ranking scales are obviously ordinal. There is nothing absolute here. Just
because someone chooses a top choice does not mean it is really a top choice.
Example:
Please rank from 1 to 4 each of the following:
__ being hit in the face with a dead rat
__ being buried up to your neck in cow manure
__ failing this course
__ having nothing to eat except for chopped liver for a month
p. 15
INTERVAL
Equal intervals, but no true zero.
Examples: IQ, temperature, GPA.
Since there is no true zero the complete absence of the characteristic you are
measuring you cannot speak about ratios.
Example:
Suppose
New York temperature = 40 degrees
Buffalo temperature = 20 degrees
Does that mean it is twice as cold in Buffalo as in NY? No.
Appropriate statistics
- same as for nominal
- same as for ordinal
plus,
- mean
RATIO
Equal intervals and a true zero.
Examples: height, weight, length, units sold
100 lbs is double 50 lbs (same for kilograms)
$100 is half as much as $200
p. 16
(1) excellent
(1) excellent
(1) excellent
(1) excellent
(1) excellent
(3) good
(3) good
(3) good
(3) good
(3) good
(4) fair
(4) fair
(4) fair
(4) fair
(4) fair
(5) poor
(5) poor
(5) poor
(5) poor
(5) poor
(7) awful
(7) awful
(7) awful
(7) awful
(7) awful
This scale is almost interval and is usually treated so means are computed.
p. 17
This course will help you learn to think for yourself. Knowledge of statistics will
allow you to see the difference between junk science and real science.
An interesting article you may want to read is by John Tierney (NY Times, Science,
October 9, 2007, pp. F1-F2 Diet and Fat: A Severe Case of Mistaken Consensus).
Apparently, the research indicating that a diet rich in fatty foods leads to a
shortened life because it causes heart disease and other ailments is wrong. Doctors
have been recommending low-fat diets for many years and fell into the trap of what
is known as informational cascade. If one person states something with a great
deal of confidence, and then another agrees with this information, then the third
person will probably go along with this since s/he will assume the first two must
know what they are talking about. By the time the cascade has run its course,
everyone will assume that the information is correct. The doctor who started the
informational cascade with fatty foods was Dr. Ancel Keys in the 1950s. He made
several blunders in his research:
(1) It is not clear that traditional diets were very lean. Ancient man did a great deal
of hunting and had considerably more fat in his diet than we do today.
(2) There are more cases of heart disease being reported today but it is due to the
fact that people live longer and are more likely to see a doctor; it is not due to worse
health.
(3) Keys correlated diet and heart disease in 6 countries (one was the US) and
found a relationship between fat in the diet and heart disease. Had Dr. Keys studied
22 countries for which data was available, he would have found no correlation.
The moral of the above is that you have to learn to think for yourself and not believe
information that may just be the view of one person and a case of mistaken cascade.
p. 18