You are on page 1of 26

# Statistics in Survey Analysis

1
Ric Coe
ICRAF, Nairobi, Kenya
Contents
Introduction.......................................................................................................................................................1
Preliminaries.....................................................................................................................................................3
Descriptive tatistics........................................................................................................................................!
". #\$o variables............................................................................................................................................%
Descriptive statistics & common problems......................................................................................................1"
Con'irmatory analysis( estimation and )ypot)esis testin*.............................................................................1!
#)e problem................................................................................................................................................1!
+stimates, standard errors and con'idence intervals...................................................................................1,
-ypot)esis tests( #)e lo*ic.........................................................................................................................1.
+/amples o' calculations............................................................................................................................10
1imitations.................................................................................................................................................."2
3)at s)ould you do...................................................................................................................................."1
Con'irmatory Analysis & Re*ression.............................................................................................................."1
tartin* Re*ression....................................................................................................................................."1
Fittin* t)e re*ression line...........................................................................................................................""
C)ec4 t)e 'it................................................................................................................................................"!
Interpretation..............................................................................................................................................."!
Addin* more variables & 5ultiple re*ression............................................................................................."!
Interpretation...................................................................................................................................................",
Re'erences......................................................................................................................................................."%
Introduction
#)is *uide summarises t)e use o' simple statistical analyses in t)e interpretation o'
survey data. It is aimed at t)e typical small surveys 6up to a 'e\$ )undred respondents7
carried out by researc)ers loo4in* at t)e role and upta4e o' ne\$ a*ricultural tec)nolo*ies.
#)ere are several common problems in t)e approac)es to survey analysis used by many
researc)ers, probably a result o' t)e researc) met)ods courses 'ollo\$ed durin* trainin*.
8ne is to concentrate attention on a 'e\$ \$ell 4no\$n statistical tec)ni9ues, suc) as c)i&
s9uared tests in "&\$ay tables and re*ression analysis, and to place a naively simplistic
reliance on t)e results. #)is is t)e topic o' t)is *uide. A second problem is to treat
1
5odi'ied 'rom input to a course :Formal data analysis 'or bean researc)ers; or*anised by CIA# at C5R#,
+*erton <niversity, February 1==%. #)an4s to oniia David 'or permission to 9uote t)e e/ample.
1
statistical analysis as a recipe t)at can be 'ollo\$ed to a success'ul conclusion \$it)out
muc) t)ou*)t or understandin* alon* t)e \$ay. #)is is t)e topic o' a companion *uide
:teps in survey analysis; 6Coe "22"7. A t)ird problem is to i*nore t)e conte/t in \$)ic)
t)e survey \$as carried out, so i*norin* many o' t)e possibilities and limitations o' t)e
statistical analysis. #)is is t)e topic o' t)e *uide :Approac)es to analysis o' survey data;
6C, "2217.
+/ample
#)e e/ample used in t)is *uide \$as a survey o' 'armers in t\$o districts o' <*anda. It
aimed to c)aracteri>e t)e pattern o' bean *ro\$in* and understand role o' ne\$ bean
varieties in t)e )ouse)old economy o' ne\$ 'armers. A 'e\$ o' t)e stated ob?ectives \$ere(
8verall( Provide a baseline a*ainst \$)ic) to measure adoption and impact o' improved
bean varieties.
-ypot)eses(
a. #)ere is no relations)ip bet\$een adoption o' ne\$ varieties and \$ealt).
b. #)e rate o' adoption 'or 5C5,221 \$ill be )i*)er in 5bale t)an 5u4ono, due to
stron* non&appreciation o' small seeded varieties in 5u4ono.
". Impact.
a. Adoption o' ne\$ varieties \$ill result in an increase in absolute 9uantities and
proportion o' beans sold, )ence increasin* )ouse)old income 'rom beans.
b. Adoption o' ne\$ varieties \$ill not result in increased sales o' 'res) beans.
c. Adoption o' ne\$ varieties \$ill not c)an*e t)e amount o' income 'rom beans controlled
by \$omen.
d. ...
#)e e/amples are based on a subset o' ?ust ,2 )ouse)olds 'rom t)e \$)ole survey o' 1.=.
#)e variables used in t)e e/ample )ave been labeled so s)ould be sel'&e/planatory.
In t)is *uide P )as been used 'or t)e statistical analysis. @eneral points appear in
normal te/t. Computer output and ot)er items relatin* speci'ically to t)e e/ample are
bo/ed.
"
Preliminaries
Ae'ore startin* analysis(
1. 5a4e sure you are 'amiliar \$it) t)e data source and collection met)ods.
For e/ample(
3as a random samplin* sc)eme usedB
3ere individual 9uestionnaires completed durin* a *roup meetin*B
3)o \$as t)e data collected byB 3)y and \$)enB
1. Clari'y ob?ectives
#)ese s)ould )ave been listed in detail \$)en t)e survey \$as planned. I' t)ey \$ere not, or
)ave c)an*ed, t)ey must be listed no\$. It is impossible to analy>e a survey i' you do
not 4no\$ \$)at you are tryin* to 'ind out.
3. Codin* and Data entry.
!. 5a4e sure you understand t)e data. Cou must understand t)e e/act meanin* o' every
number and code.
Data t)at needs clari'yin*.
Dariable 3ID+ 6Euestion 37( Does :1; mean 1 \$i'e or " \$ivesB 6con'lict bet\$een
9uestionnaire and code boo47.
Dariable ARRAN@+ 6Euestion !7. Does :NA; mean t)ere are no bean plots or no
)usbandF\$i'eB
Dariables 8CC<P-DI and 8CC<P-D" 6Euestion 07( 3)y are t\$o occupations *iven
\$)en t)e 9uestion as4s 'or t)e main occupationB
Dariable KA3=!A 6Euestion "17. 3)at is t)e di''erence bet\$een :na; and :No;B
Dariable A5K3=!A Euestion "17. 3)at are t)e unitsB
3
Descriptive Statistics
1. ummari>in* in*le Dariables
Eualitative 6GCodedH7 variables.
<se'ul summaries are ?ust 're9uencies and percenta*es.
!
MATOKE Grows matoke
Valid Cum
Value Label Value Frequency Percent Percent Percent
Yes 1 42 84.0 84.0 84.0
No 2 8 16.0 16.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
HHTYPE Household type
Valid Cum
Value Label Value Frequency Percent Percent Percent
Male headed one wife 1 27 54.0 54.0 54.0
Male headed more tha 2 4 8.0 8.0 62.0
Female headed absent 3 3 6.0 6.0 68.0
Female headed, no hu 4 13 26.0 26.0 94.0
Single man 5 2 4.0 4.0 98.0
Other 7 1 2.0 2.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
Note di''erent emp)asis o' 're9uencies and percenta*es. Fre9uencies emp)asi>e
t)e sample, percenta*es emp)asi>e t)e population. @ive total sample si>e \$it)
percenta*es.
#a4e care \$it) percenta*es( ma4e sure you are usin* an appropriate baseline
6\$)at is 122I7 and remember t)at percenta*es mi*)t not )ave to add to 122, as in
t)e e/ample belo\$.
+dit t)e computer output 'or presentationJ
Crop I *ro\$in*
Cassava 122
Aeans =0
5ato4e 0!
5ai>e .0
Cams "2
ample si>e ,2
1oo4 care'ully at and identi'y rare cases. uc) data points may be errors, or may
need special treat

## 3)at is t)e 1 Got)erH )ouse)old type in 9uestion "B

8ne 'armer does not *ro\$ beans. )ould t)is case be deleted 'rom all
analysesB
Aar c)arts are most appropriate \$)en t)e cate*ories can be ordered in some use'ul
\$ay.
Quantitative Variables
In summari>in* 9uantitative variables t)e most interestin* t)in*s are(
o 1ocation 63)at is a typical value7
o pread 6-o\$ muc) variation is t)ereB7
o 8dd values 63)at is t)eir source and interpretationB7
1ocation is measured by mean or median 6not use'ully t)e mode7
pread is measured by standard deviation or distance bet\$een 9uartiles.
,
Euantities suc) as t)e 12I and =2I point are use'ul in some situations.
<se -isto*rams and bo/plots.
2. Two variables.
Two qualitative variables = cross tabulation
Interpretation can be )elped by care'ul layout.
Percenta*es may be calculated o' ro\$ totals, column totals or overall totals. Not
all o' t)em \$ill ma4e senseJ
%
Amount o' beans )arvested in =!a
5ean 1,.=
tandard deviation 3!."
5edian !.2
",I point 2
.,I 1!.2
5ean 6i*norin* "227 12.1
total beans harvested 94a
200.0 175.0 150.0 125.0 100.0 75.0 50.0 25.0 0.0
40
30
20
10
0
Std. Dev = 34.21
Mean = 16.0
N = 47.00
Household type
Crop earning
highest income
Male
Female
Single
Male Total
Co''ee 1= . 1 ".
@roundnut " ! 2 %
Ao*oya 1 3 2 !
Cassava 1 2 1 "
5ato4e " 2 2 "
Aeans 1 2 2 1
8t)er , 2 2 ,
No sales 2 " 2 "
#otal !=
ne qualitative and one quantitative variable = group comparison
.
Two quantitative variables
A scatter dia*ram is t)e only really use'ul \$ay to summari>e t\$o 9uantitative
variables and t)eir relations)ip.
#)e correlation coe''icient is a summary o' t)e stren*t) o' linear relations)ip
bet\$een variables. It s)ould N8# be 9uoted unless t)e data )ave 'irst been loo4ed
at in a scatter dia*ram.
I' t)ere appears to be a relations)ip bet\$een variables t)e points to loo4 'or are(
1. Is t)e relations)ip monotonicB
0
Total beans harvested in !"a
Household type
Male Female
5ean 31.3 ,.=
5edian 12.2 2
",I point 2 2
Number 31 1%
15 31 1 N =
Simpliied hht!pe
emal e mal e Mi ssi n"
t
o
t
a
l

b
e
a
n
s

h
a
r
v
e
s
t
e
d

9
4
a
50
45
40
35
30
25
20
15
10
5
0
9
16
23
12
6
". Are t)e variables ne*atively or positively related.
3. Can t)e relations)ip be summari>ed by a strai*)t lineB
!. -o\$ muc) e''ect does K )ave on CB
,. -o\$ )i*)ly clustered are points around a lineB
%. Are t)ere any *aps in t)e plot or do \$e )ave data values coverin* t)e \$)ole
ran*e o' K or CB
.. Are t)ere any outliers or odd observationsB
total amo#nt beans planted 94a
50 40 30 20 10 0 \$10
t
o
t
a
l

b
e
a
n
s

h
a
r
v
e
s
t
e
d

9
4
a
300
200
100
0
\$100
Simpliied hht!pe
emal e
mal e
=
Three or more variables
3)en t)ree or more variables are bein* investi*ated, cross tabulations become
sparse and di''icult to interpret and clear *rap)s di''icult to construct.
A simple e/ample o' t)e need 'or not al\$ays considerin* ?ust t\$o variables at a
time is *iven. In bot) Re*ion 1 and Re*ion " it is clear adoption is not related to
income 6%.I adopt in bot) )i*) and lo\$ income *roups in Re*ion 1 and 33I in
Re*ion "7 but i' t)e sum o' t)e t\$o re*ions is studied t)ere appears to be )i*)er
adoption in t)e )i*) income *roup.
+/actly t)e same t)in* occurs \$it) continuous variables \$)ere spurious correlation
6or lac4 o' it7 can be due to a t)ird variable \$)ic) )as not been allo\$ed 'or. 5ore
advanced *rap)ical 6e.*. small multiple pictures7 and numerical 6re*ression and lo*&
linear modelin*, multivariate met)ods suc) as principal components7 met)ods e/ist
to )elp t)ere.
12
Arti'icial
+/ample
Re*ion 1
L M
Incom
e
1 12 "2
- "2 !2
Re*ion "
& M
Incom
e
1 !2 "2
- "2 12
8verall
& M
Incom
e
1 ,2 !2
- !2 ,2
pl anted 94a
pl anted 94b
harvested 94a
harvested 94b
11
Descriptive statistics - common problems
Use of standard techniques rather than the most appropriate.
An e/ample is t)e )isto*ram to s)o\$ t)e distribution o' a continuous variable. #)e
)isto*ram s)o\$s 'eatures suc) as location and s4e\$ness. -o\$ever, ot)er
possibilities are cumulative )isto*rams 6\$)ic) s)o\$ I points7, bo/plots 6*ood 'or
comparin*, and s)o\$in* outliers7, 9&9 or normal probability plots 6to c)ec4 i' t)e
variable )as a normal distribution7 or stem&and&lea' plots 6to loo4 at individual
values7.
Ae ima*inative & 'ind t)e best \$ay to display t)e in'ormation you \$ant.
%isto"ram
&M'()94&
N
o
o
o
b
s
0
2
4
6
*
10
12
14
16
1*
20
22
24
26
2*
+= 0 ,0-5. ,5-10. ,10-15. ,15-20. ,20-25. ,25-30. ,30-35. ,35-40. / 40
0#mm#lative histo"ram
&M'()94&
N
o
o
o
b
s
0
4
*
12
16
20
24
2*
32
36
40
44
4*
52
+= 0 ,0-5. ,5-10. ,10-15. ,15-20. ,20-25. ,25-30. ,30-35. ,35-40. / 40
Non\$1#tlier Ma2 = 7
Non\$1#tlier Min = 0
753 = 3
253 = 0
Median = 1.75
1#tliers
42tremes
5o2 'lot
0
10
20
30
40
&M'()94&
6#antile\$6#antile
Distrib#tion7 Normal
)heoreti8al 6#antile
1
b
s
e
r
v
e
d
9
a
l#
e
.05 .1 .25 .5 .75 .9 .95 .99
\$10
0
10
20
30
40
50
\$2 \$1 0 1 2 3
Use of techniques you can get your computer to do.
5uc) statistics so't\$are is very 'le/ible. I' you learn enou*) about it you can *et it
to do most t)in*s, but not everyt)in*.
Ae prepared to do some analysis, includin* dra\$in* o' *rap)s or tables, by )and.
Concentration on means when variation is important.
1"
Cases \$)ic) deviate 'rom t)e mean, contributin* to variability, are probably ?ust as
important as t)e avera*e values.
5a4e sure you understand \$)et)er variation is important, and i' so, describe it.
Limited use of derived quantities.
It is unli4ely t)at eac) substantive 9uestion can be ans\$ered 'rom columns o' ra\$ data
alone. Calculations o' ne\$ variables is certain to be important.
Calculate ne\$ variables t)at are needed to ans\$er t)e 9uestions.
Confusion over the unit of analysis.
5any datasets contain data collected at more t)at 1 level 6 e.*. plot, person, )ouse)old,
community7. Analyses must use t)e relevant level. 5i/ed levels are almost \$ron*.
+ven in surveys \$it) data collected at one level t)ere is room 'or con'usion
re*ardin*, 'or e/ample, calculations o' percenta*es.
Dariety Number o'
'armers plantin*
in =!A
Avera*e o' t)ose
'armers \$)o planted
Ka\$anda 11 ".!,
5anyi*amulimi "1 12.,3
Kanyeb\$a 2 &
3)ite )aricot 2 &
All ot)ers 1! ".2!
No beans planted 10 &
#)e various interestin* percenta*es are(
Percent o' all 'armers plantin* Ka\$anda N 11F,2 N ""I
Percent o' all 'armers \$)o planted in =!A \$)o planted Ka\$anda
N 11F6,2&107 N 3!I
Percent o' amount planted t)at \$as planted to Ka\$anda
N 611 / ".!,7 F 611 / ".!, M "1 / 12.,3 M 1! / ".2!7 N
"%.=,F".%.%! N =..I
Not working with relevant subsets of the data
13
)ould t)e 'armer \$)o never *ro\$s beans be deleted 'rom t)e datasetB )ould cases
'or \$)om 'armin* is not t)e main occupation be omitted \$)en analy>in* economic
activityB
5a4e sure all relevant data, but no irrelevant data, is bein* used.
oor handling of outliers.
Ae on t)e loo4 out 'or all odd observations, \$)ic) mi*)t represent mista4es or
unusual cases. 5ista4es must be corrected. #reatment o' unusual cases depends on
conte/t. Includin* t)em can distort t)e picture. 8mittin* t)em can induce bias.
!alance between "#ploratory analysis and \$ata \$redging
+/ploratory analysis means loo4in* 'or interestin* patterns in t)e data \$it)out
'ocusin* on a speci'ic 9uestion 6e.*. G3)o are t)e 'armers \$)o )ave )eard o' t)e ne\$
varietyBH7. #)is can be valuable, and s)o\$ up 'acts \$)ic) )ad not been t)ou*)t o' or
)ypot)esi>ed.
Data dred*in* means searc)in* t)rou*) many statistics until :somet)in* turns up;.
For e/ample, doin* a cross&calculation o' G-eard o' ne\$ varietiesH \$it) every ot)er
9ualitative variable. #)e results \$ill be spurious 6 i' you searc) t)rou*) enou*)
columns o' random numbers you \$ill eventually 'ind :interestin*; correlations7.
#)e distinction bet\$een t)e t\$o approac)es is 'ineJ
Confirmatory analysis: estimation and
hypothesis testing
The problem
A. -ouse)old #ype
5ale Female
1abour
Never )ire
or e/c)an*e "3 13 3%
-ire or
e/c)an*e 12 3 13
33 1% !=
1!
In t)e #able A \$e can see(
33I o' t)e )ouse)olds are 'emale )eaded.
32I o' male )eaded )ouse)olds )ire labour, but only 1=I o' 'emale )eaded )ouse)olds
do.
A. Farmers \$)o planted beans in =! a
5ale Female 8verall
Amount 5ean %., ".= ,.0
Planted s.d. =., 1.3 0.%
n "! % 32
In #able A \$e can see(
#)e mean amount o' beans planted in =!a by 'armers \$)o *re\$ beans t)at season is ,.0
4*.
#)e amount planted by males \$as %., 4*, but only ".= 4* by 'emales.
All t)ese results are based on data 'rom a sample o' ?ust ,2 'armers in t)e district.
-o\$ reliable are t)eyB I' \$e )ad measured a di''erent ,2 )o\$ similar \$ould t)e
results )ave beenB I' \$e )ad measured ,22, or t)e \$)ole population, \$ould t)e
conclusions )ave been muc) t)e sameB
#)e results di''er 'rom :true; ans\$er 'or t\$o reasons(
Non samplin* errors & incorrect responses, mista4es in codin* and data entry, poor
recall, biased selection o' respondents.
amplin* errors & t)ose due to t)e 'act t)at \$e )ave measured only some 6a sample7
o' t)e population.
#)e non&samplin* errors can not usually be measured, but can be minimi>ed by *ood
survey practice. amplin* errors can be measured, and t)at is t)e purpose o' muc)
con'irmatory statistics.
Estimates, standard errors and confidence intervals.
1,
#roportions
#)e proportion o' 'emale )eaded )ouse)olds in t)e population is P. P is un4no\$n.
#)e sample value is p N 2.33 6 N 1%F!=7. #)e uncertainty due to samplin* errors in
t)is is measured by t)e standard error. #)e standard error is se p
p p
n
6 7
6 7
=
1
,
\$)ere n N sample si>e.
se6p7 is estimated by
. 6 . 7
.
33 1 33
!=
2.

=
#)is is t)e standard deviation o' possible estimates t)at could be produced by
di''erent simple random samples o' t)e same si>e.
#)e standard error is best interpreted via a confidence interval. A =,I con'idence
interval 'or p is p O " / se6p7
N 2.33 O " / 2.2.
N 62.1=, 2.!.7
#)is is interpreted as G3e are =,I con'ident t)at t)e true percenta*e o' 'emale
)eaded )ouse)olds is bet\$een 1=I and !.IH. -ence t)e uncertainty in results due
to samplin* error is 9uanti'ied.
Means
#)e mean amount o' beans planted in =!a is ,.0 4*. #)e standard deviation o' t)is is
se mean
s
n
6 7 =
"
, \$)ere s
"
is t)e variance in amount o' beans and n t)e sample si>e.
se mean 6 7
.
. = =
0 %
32
1 %
"
#)e =,I con'idence interval is
mean O " / se6mean7
N O " / 1.%
N 6".%, =.27
#)e mean amount o' beans planted is bet\$een ".% and =.2 4*.
\$i%%erences
I' interested in di''erences bet\$een sub*roups \$e can similarly estimate t)e
di''erence and 'ind a standard error o' t)e estimate.
1%
Di''erence in mean amount o' beans planted by
males and 'emales N %., & ".=
N 3.% 4*.
se difference
s
n
s
n
6 7 = +
1
"
1
"
"
"
N
= ,
"!
1 3
%
" "
. .
+
N ".2
=,I con'idence interval 'or di''erence is
3.% O " / ".2
6&2.!, ..%7
#)e mean di''erence bet\$een amounts planted by males and 'emales could be
anyt)in* bet\$een &2.! 4* and ..% 4*.
Hypothesis tests: The logic
#)e lo*ic o' all t)e tests commonly used depends on t)e 'act t)at random samples 'rom a
population be)ave in a predictable \$ay. #)e mean amount o' beans planted by 'emale
)ouse)olds o' ".= 4*, is not t)e actual mean o' all )ouse)olds in t)e districts \$)ere t)e study
too4 place. I' a di''erent sample )ad been randomly selected t)e mean \$ould )ave been
di''erent. #)e 9uestion is :-o\$ di''erentB;. I' all )ouse)olds are very similar 6lo\$ variation
bet\$een )ouse)olds7 t)en it really does not matter \$)ic) sample is selected. 8n t)e ot)er
)and, )i*) variation in t)e population \$ill lead to very di''erent sample means, and )ence
less certainty in t)e results obtained. #)e mat)ematics o' statistics allo\$s 9uanti'ication o'
t)ese ideas, and )ence ans\$ers to t)e 9uestion o' )o\$ certain \$e are o' t)e results.
#)e lo*ic o' t)e )ypot)esis tests is as 'ollo\$s(
1. Assume some 'act is true & t)e null )ypot)esis 6e.*. #)ere is no di''erence in mean
amount o' beans planted by male and 'emale )eaded income )ouse)olds7.
". Deduce )o\$ t)e sample \$ould be)ave i' 617 is true 6e.*. -o\$ bi* could t)e sample
di''erences bet\$een male and 'emale )eaded )ouse)olds beB7
3. Compare t)e actual sample \$it) t)e predictions in 6"7.
1.
!. I' 6"7 and 637 do not a*ree t)en 617 must be untrue & t)e null )ypot)esis is re?ected.
I' 6"7 and 637 do a*ree t)en t)ere is no reason, in t)is data, not to believe 617.
#)e level o' a*reement is measured by t)e Psi*ni'icance levelP, e/plained in t)e e/amples
belo\$.
Examples of calculations
Chi&squared test %or no association in a ' ( ' table)
#a4in* #able A as an e/ample, \$e \$ant to test \$)et)er t)e proportion o'
)ouse)olds )irin* labour is t)e same in male and 'emale )eaded )ouse)olds. #)e steps are(
1. Formulate t)e null )ypot)esis( t)e proportion is e9ual 'or bot) male and
'emale )ouse)olds.
I' 617 is true, t)en t)is proportion is estimated by 3%F!=. -ence \$e \$ould e/pect numbers in
eac) cate*ory to be (
10
5ale Female
Never )ire
33
3%
!=
"! " x = . 1%
3%
!=
11 0 x = .
-ire
3
13
!=
0 0 3 x = . 1%
13
!=
! " = .
3. #)e di''erence bet\$een observed and e/pected 're9uencies is summarised as
!. I' 617 is valid t)en t)e value o'
"
s)ould be an observation 'rom a
1
"
&
distribution. Comparison \$it) tables s)o\$s t)at 2..! is not an e/treme observation. A
number at least as bi* as t)is \$ould occur 3=I o' t)e time. #)e si*ni'icance level is p N
2.3=. -ence t)ere is no stron* reason not to believe t)e null )ypot)esis.
t&test to compare two means
In e/ample A t)e steps needed are(
1. Formulate t)e null )ypot)esis( t)e di''erence in mean amount o' beans
planted 'or male and 'emale )ouse)olds is >ero.
",3 I' 617 is true, t)en t)e di''erence in means o' 3.%4*, scaled by its standard
error
6N ".27 ,
t = =
3 %
" 2
1 0
.
.
. ,
is an observation 'rom a t
"0
distribution.
1=
2
2 2 2 2
=
( - 3 )
+
( - )
+
( - )
+
( - )
=
"! " "
"! "
11 0 13
11 0
0 0 12
0 0
! " 3
! "
2 .!
.
.
.
.
.
.
.
.
.
!. Comparison \$it) tables s)o\$s t)at 1.0 is not an e/treme observation. A
di''erence as bi* as t)is \$ould occur 0I o' t)e time 617 is true. #)e si*ni'icance level is p N
2.20. -ence t)ere is not muc) reason not to believe t)e null )ypot)esis.
Limitations
*ssumptions)
#)e calculations in bot) !.1 and !." are based on a series o' assumptions. #)e 4ey
ones are(
Independence. In bot) e/amples A and A \$e assume observations are independent.
1ac4 o' independence is caused by(
6i7 non&simple random samples. In t)is case \$e )ave used a strati'ied sample.
6ii7 inter'erence bet\$een observations. #)is \$ould be t)e case i' individuals
\$it)in t)ese )ouse)old responded, or i' data \$ere collected at a *roup meetin*.
1ac4 o' bias due to non&response, intervie\$er e''ects, attempts to PpleaseP t)e
researc)er etc.
+9uality o' variance and normal distribution 6t&test7. #)ese assumptions can be
c)ec4ed. In e/ample A t)e data is clearly not normally distributed
+imits to interpretation)
617 I' t)e result is :si*ni'icant; \$e can re?ect t)e null )ypot)esis, and conclude
t)at t)ere is a real di''erence in t)e population. I' t)e result is :not si*ni'icant; \$e )ave not
proved t)ere is no di''erence. It is never possible to prove t)e null )ypot)esis is true 6i'
almost never \$ill beJ7. All \$e can say is t)is study )as not produced evidence to ma4e us
disbelieve t)e null )ypot)esis.
6"7 At \$)at level o' si*ni'icance s)ould t)e null )ypot)esis be re?ectedB ,I is
commonly used but t)ere is absolutely no reason \$)y it s)ould be treated as a ri*id cut o''.
%I and !I si*ni'icance levels are, 'or all real purposes, e9uivalent.
637 3)et)er t)e null&)ypot)esis is re?ected depends as muc) on t)e sample si>e
and precision o' t)e study, as on t)e Ptrut)P o' t)e null )ypot)esis. A small, imprecise survey
\$ill not detect a di''erence t)at could be pic4ed up by a lar*er study. 5ay be \$e ?ust did
not collect enou*) dataJ
"2
6!7 #)e \$)ole lo*ic o' si*ni'icance testin* and t)e p&value rests on \$)at \$ould
)appen in repeated surveys o' t)e same desi*n, usin* ne\$ randomisations. Is t)is sense,
\$)en \$e 4no\$ t)e survey \$ould not and can not ever be repeatedB
6,7 In most analysis e/ercises, di''erences \$)ic) Ploo4 interestin*P at t)e
e/ploratory sta*e are investi*ated 'urt)er in t)e con'irmatory analysis. I' t)e tests to
per'orm )ave been selected because di''erences loo4 lar*e, all si*ni'icance levels are
invalid.
6%7 I' a lar*e number o' tests are per'ormed, as is o'ten t)e case in analysis o' a
study \$it) many variables, t)en \$e \$ould e/pect ,I o' t)e tests to *ive Qsi*ni'icantQ results
at t)e p N 2., level even i' all null )ypot)eses \$ere true. -ence it can be di''icult to
interpret t)e results o' multiple tests.
hat should you do
617 #reat t)e si*ni'icance level p as an indication o' Pstren*t) o' evidenceP
a*ainst t)e null )ypot)esis, not as a CesFNo decision ma4er.
6"7 Concentrate on estimatin* t)e si>e o' di''erences, rat)er t)an ?ust testin*
\$)et)er t)ey e/ist. Con'idence intervals 'or di''erences \$ill be muc) more use'ul t)an
)ypot)esis tests.
At t)e end o' every si*ni'icance test apply t)e 8 3-A#B test. As4 yoursel' Po
\$)atBP. -as t)e si*ni'icance test really improved your understandin* o' t)e situation
and )elped you ta4e a rational decision 'or 'uture actionB I' not 'or*et it, and *et on
\$it) somet)in* more use'ul.
Confirmatory Analysis - Regression
!tarting "egression
& Ae\$areJ
+ven :simple; re*ression is not simpleJ
& tart by considerin* types o' relations)ip t)at mi*)t e/ist. #)e most use'ul re*ression
analysis \$ill be one t)at starts 'rom understandin* o' t)e t)eory be)ind t)e process bein*
studied.
"1
#)e e/ample used )ere is rat)er arti'icial. It e/amines t)e proposition t)at t)e amount o'
beans )arvested in =!a depends only on land area.
& Plot t)e data to see i' t)ere is any evidence o' t)e relations)ip.
(&ND&:4&
%
9
)
1
)
9
4
&
\$20
20
60
100
140
1*0
220
\$1 1 3 5 7 9 11
#itting the regression line
& o't\$are is \$idely available to do t)is
& <nderstand t)e outputJ

""
"3
* * * * M U L T I P L E R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
Equation Number 1 Dependent Variable.. HVTOT94A
total beans harvested 94a
Block Number 1. Method: Enter LANDAREA
Variable(s) Entered on Step Number
1.. LANDAREA
Multiple R .54425
R Square .29621
Standard Error 29.01659
Analysis of Variance
DF Sum of Squares Mean Square
Regression 1 15946.10384 15946.10384
Residual 45 37888.31105 841.96247
F = 18.93921 Signif F = .0001
------------------ Variables in the Equation
------------------
Variable B SE B Beta T
Sig T
LANDAREA 8.200238 1.884280 .544249 4.352 .
0001
(Constant) -2.863844 6.051297 -.473 .
6383
End Block Number 1 All requested variables entered.
\$hec% the fit
& 1oo4 'or any unusual points or outliers. #)ey could represent mista4es or cases t)at
re9uire special treatment. #)ey certainly re9uire e/planation.
& 1oo4 'or in'luential points, \$)ic) lar*ely determine results. #)ey are not a bad t)in*,
but you must be a\$are i' your conclusions depend critically on one or t\$o observations.
& 1oo4 at t)e residuals to determine(
1. 3)et)er t)ey satis'y t)e main assumptions t)at validate t)e analysis 6constant
variance, independence, rou*)ly normally distributed7
". 3)et)er t)ey s)o\$ patterns accordin* to t)e value o' ot)er variables, indicatin* t)at
t)ose ot)er variables s)ould be allo\$ed 'or in t)e analysis.
&nterpretation
:i*ni'icance; does not tell you \$)et)er t)e 'itted model is lo*ically sound or i' it 'its
t)e data \$ell.
:i*ni'icance; does not tell you \$)et)er t)e model is use'ul in e/plainin* or
describin* a relations)ip, or i' t)e relations)ip )as muc) predictive po\$er.
A re*ression model derived 'rom survey data can not tell you \$)at \$ould )appen
\$)en a :/&variable; is c)an*ed. For e/ample \$e can not use it to predict t)e bean
)arvest o' a 'armer \$)ose land )oldin* c)an*es.
+/istence o' a re*ression relations)ip bet\$een t\$o variables does not mean t)ere is a
causal relations)ip.
Re*ression relations)ips become use'ul \$)en similar relations)ips are 'ound in a number
o' di''erent conditions. 1oo4 'or :si*ni'icant sameness; bet\$een re*ions, crops, 'arm
types, etc.
'dding more variables ( )ultiple regression
5ultiple re*ression is a po\$er'ul tool 'or understandin* t)e relations)ip o' one
variable to several ot)ers. A<#.....
All t)e limitations to interpretation above apply, and are compounded by t)e
e/istence o' several :/&variables;.
It is )ard to dra\$ *rap)s t)at s)o\$ t)e relations)ips and t)e \$ay data depart 'rom
t)em, so t)e analyst must rely more on numerical indicators o' lac4 o' 'it, outliers,
"!
and in'luential points. 5ultiple re*ression analysis \$ill not be success'ul i' t)ese are
not understood.
:tep\$ise; and similar variable selection tec)ni9ues, so loved by social scientists,
)ave little t)eoretical basis and can produce ans\$ers \$)ic) are very poor. Re*ression
modelin* \$ill be most success'ul i' understandin* o' t)e underlyin* processes is
used to c)oose possible models, rat)er t)an relyin* on computer al*orit)ms.
#)e sample si>e re9uired 'or multiple re*ression analysis depends on t)e
:con'i*uration; o' t)e data 6in particular t)e ran*e o' t)e /&variables and correlations
amon* t)em7. #)e re9uired sample si>e 9uic4ly becomes lar*e as t)e number o' /&
variables increases. I' re*ression analysis is t)e part o' t)e principle ob?ectives o' t)e
survey, it mi*)t be possible to select t)e sample in a \$ay t)at ma4es t)e analysis
more e''icient.
:a; resid#als vs. %%)<'42
%%)<'42
:
a
;

r
e
s
i
d
#
a
l
s
\$*0
\$40
0
40
*0
120
160
1 2
Interpretation
Interpret results. #)is does not mean :understand \$)ic) e''ects are si*ni'icant; but
:understand and communicate \$)at you no\$ 4no\$ about t)e problem;. Cou s)ould be
able to(
5eet t)e ob?ectives o' t)e study.
Clearly state \$)at is t)e substantive ne\$ 4no\$led*e \$)ic) as been *enerated.
)o\$ )o\$ t)is ne\$ in'ormation and understandin* builds on \$)at \$as t)ere
be'ore. Does it(
o add more e/amples o' somet)in* previously 4no\$nB
o mean t)at *eneral rules or principles can be stated \$it) more con'idenceB
",
o allo\$ predictions to be made 'or ne\$ and important situationsB
o mean t)at current understandin* or t)eory )as to be substantially
modi'iedB
<se t)e 9uantitative in'ormation you )ave *enerated to ma4e 9uantitative