Statistics in Survey Analysis

Statistics in Survey Analysis
1
Ric Coe
ICRAF, Nairobi, Kenya
Contents
Introduction.......................................................................................................................................................1
Preliminaries.....................................................................................................................................................3
Descriptive tatistics........................................................................................................................................!
". #$o variables............................................................................................................................................%
Descriptive statistics & common problems......................................................................................................1"
Con'irmatory analysis( estimation and )ypot)esis testin*.............................................................................1!
#)e problem................................................................................................................................................1!
+stimates, standard errors and con'idence intervals...................................................................................1,
-ypot)esis tests( #)e lo*ic.........................................................................................................................1.
+/amples o' calculations............................................................................................................................10
1imitations.................................................................................................................................................."2
3)at s)ould you do...................................................................................................................................."1
Con'irmatory Analysis & Re*ression.............................................................................................................."1
tartin* Re*ression....................................................................................................................................."1
Fittin* t)e re*ression line...........................................................................................................................""
C)ec4 t)e 'it................................................................................................................................................"!
Interpretation..............................................................................................................................................."!
Addin* more variables & 5ultiple re*ression............................................................................................."!
Interpretation...................................................................................................................................................",
Re'erences......................................................................................................................................................."%
Introduction
#)is *uide summarises t)e use o' simple statistical analyses in t)e interpretation o'
survey data. It is aimed at t)e typical small surveys 6up to a 'e$ )undred respondents7
carried out by researc)ers loo4in* at t)e role and upta4e o' ne$ a*ricultural tec)nolo*ies.
#)ere are several common problems in t)e approac)es to survey analysis used by many
researc)ers, probably a result o' t)e researc) met)ods courses 'ollo$ed durin* trainin*.
8ne is to concentrate attention on a 'e$ $ell 4no$n statistical tec)ni9ues, suc) as c)i&
s9uared tests in "&$ay tables and re*ression analysis, and to place a naively simplistic
reliance on t)e results. #)is is t)e topic o' t)is *uide. A second problem is to treat
1
5odi'ied 'rom input to a course :Formal data analysis 'or bean researc)ers; or*anised by CIA# at C5R#,
+*erton <niversity, February 1==%. #)an4s to oniia David 'or permission to 9uote t)e e/ample.
1
statistical analysis as a recipe t)at can be 'ollo$ed to a success'ul conclusion $it)out
muc) t)ou*)t or understandin* alon* t)e $ay. #)is is t)e topic o' a companion *uide
:teps in survey analysis; 6Coe "22"7. A t)ird problem is to i*nore t)e conte/t in $)ic)
t)e survey $as carried out, so i*norin* many o' t)e possibilities and limitations o' t)e
statistical analysis. #)is is t)e topic o' t)e *uide :Approac)es to analysis o' survey data;
6C, "2217.
+/ample
#)e e/ample used in t)is *uide $as a survey o' 'armers in t$o districts o' <*anda. It
aimed to c)aracteri>e t)e pattern o' bean *ro$in* and understand role o' ne$ bean
varieties in t)e )ouse)old economy o' ne$ 'armers. A 'e$ o' t)e stated ob?ectives $ere(
8verall( Provide a baseline a*ainst $)ic) to measure adoption and impact o' improved
bean varieties.
-ypot)eses(
1. Adoption.
a. #)ere is no relations)ip bet$een adoption o' ne$ varieties and $ealt).
b. #)e rate o' adoption 'or 5C5,221 $ill be )i*)er in 5bale t)an 5u4ono, due to
stron* non&appreciation o' small seeded varieties in 5u4ono.
". Impact.
a. Adoption o' ne$ varieties $ill result in an increase in absolute 9uantities and
proportion o' beans sold, )ence increasin* )ouse)old income 'rom beans.
b. Adoption o' ne$ varieties $ill not result in increased sales o' 'res) beans.
c. Adoption o' ne$ varieties $ill not c)an*e t)e amount o' income 'rom beans controlled
by $omen.
d. ...
#)e e/amples are based on a subset o' ?ust ,2 )ouse)olds 'rom t)e $)ole survey o' 1.=.
#)e variables used in t)e e/ample )ave been labeled so s)ould be sel'&e/planatory.
In t)is *uide P )as been used 'or t)e statistical analysis. @eneral points appear in
normal te/t. Computer output and ot)er items relatin* speci'ically to t)e e/ample are
bo/ed.
"
Preliminaries
Ae'ore startin* analysis(
1. 5a4e sure you are 'amiliar $it) t)e data source and collection met)ods.
For e/ample(
3as a random samplin* sc)eme usedB
3ere individual 9uestionnaires completed durin* a *roup meetin*B
3)o $as t)e data collected byB 3)y and $)enB
1. Clari'y ob?ectives
#)ese s)ould )ave been listed in detail $)en t)e survey $as planned. I' t)ey $ere not, or
)ave c)an*ed, t)ey must be listed no$. It is impossible to analy>e a survey i' you do
not 4no$ $)at you are tryin* to 'ind out.
3. Codin* and Data entry.
!. 5a4e sure you understand t)e data. Cou must understand t)e e/act meanin* o' every
number and code.
Data t)at needs clari'yin*.
Dariable 3ID+ 6Euestion 37( Does :1; mean 1 $i'e or " $ivesB 6con'lict bet$een
9uestionnaire and code boo47.
Dariable ARRAN@+ 6Euestion !7. Does :NA; mean t)ere are no bean plots or no
)usbandF$i'eB
Dariables 8CC<P-DI and 8CC<P-D" 6Euestion 07( 3)y are t$o occupations *iven
$)en t)e 9uestion as4s 'or t)e main occupationB
Dariable KA3=!A 6Euestion "17. 3)at is t)e di''erence bet$een :na; and :No;B
Dariable A5K3=!A Euestion "17. 3)at are t)e unitsB
3
Descriptive Statistics
1. ummari>in* in*le Dariables
Eualitative 6GCodedH7 variables.
<se'ul summaries are ?ust 're9uencies and percenta*es.
!
MATOKE Grows matoke
Valid Cum
Value Label Value Frequency Percent Percent Percent
Yes 1 42 84.0 84.0 84.0
No 2 8 16.0 16.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
HHTYPE Household type
Valid Cum
Value Label Value Frequency Percent Percent Percent
Male headed one wife 1 27 54.0 54.0 54.0
Male headed more tha 2 4 8.0 8.0 62.0
Female headed absent 3 3 6.0 6.0 68.0
Female headed, no hu 4 13 26.0 26.0 94.0
Single man 5 2 4.0 4.0 98.0
Other 7 1 2.0 2.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
Note di''erent emp)asis o' 're9uencies and percenta*es. Fre9uencies emp)asi>e
t)e sample, percenta*es emp)asi>e t)e population. @ive total sample si>e $it)
percenta*es.
#a4e care $it) percenta*es( ma4e sure you are usin* an appropriate baseline
6$)at is 122I7 and remember t)at percenta*es mi*)t not )ave to add to 122, as in
t)e e/ample belo$.
+dit t)e computer output 'or presentationJ
Crop I *ro$in*
Cassava 122
Aeans =0
5ato4e 0!
5ai>e .0
Cams "2
ample si>e ,2
1oo4 care'ully at and identi'y rare cases. uc) data points may be errors, or may
need special treat
3)at is t)e 1 Got)erH )ouse)old type in 9uestion "B

8ne 'armer does not *ro$ beans. )ould t)is case be deleted 'rom all
analysesB
Aar c)arts are most appropriate $)en t)e cate*ories can be ordered in some use'ul
$ay.
Quantitative Variables
In summari>in* 9uantitative variables t)e most interestin* t)in*s are(
o 1ocation 63)at is a typical value7
o pread 6-o$ muc) variation is t)ereB7
o 8dd values 63)at is t)eir source and interpretationB7
1ocation is measured by mean or median 6not use'ully t)e mode7
pread is measured by standard deviation or distance bet$een 9uartiles.
,
Euantities suc) as t)e 12I and =2I point are use'ul in some situations.
<se -isto*rams and bo/plots.
2. Two variables.
Two qualitative variables = cross tabulation
Interpretation can be )elped by care'ul layout.
Percenta*es may be calculated o' ro$ totals, column totals or overall totals. Not
all o' t)em $ill ma4e senseJ
%
Amount o' beans )arvested in =!a
5ean 1,.=
tandard deviation 3!."
5edian !.2
",I point 2
.,I 1!.2
5ean 6i*norin* "227 12.1
total beans harvested 94a
200.0 175.0 150.0 125.0 100.0 75.0 50.0 25.0 0.0
40
30
20
10
0
Std. Dev = 34.21
Mean = 16.0
N = 47.00
Household type
Crop earning
highest income
Male
Headed
Female
Headed
Single
Male Total
Co''ee 1= . 1 ".
@roundnut " ! 2 %
Ao*oya 1 3 2 !
Cassava 1 2 1 "
5ato4e " 2 2 "
Aeans 1 2 2 1
8t)er , 2 2 ,
No sales 2 " 2 "
#otal !=
ne qualitative and one quantitative variable = group comparison
.
Two quantitative variables
A scatter dia*ram is t)e only really use'ul $ay to summari>e t$o 9uantitative
variables and t)eir relations)ip.
#)e correlation coe''icient is a summary o' t)e stren*t) o' linear relations)ip
bet$een variables. It s)ould N8# be 9uoted unless t)e data )ave 'irst been loo4ed
at in a scatter dia*ram.
I' t)ere appears to be a relations)ip bet$een variables t)e points to loo4 'or are(
1. Is t)e relations)ip monotonicB
0
Total beans harvested in !"a
Household type
Male Female
5ean 31.3 ,.=
5edian 12.2 2
",I point 2 2
Number 31 1%
15 31 1 N =
Simpliied hht!pe
emal e mal e Mi ssi n"
t
o
t
a
l

b
e
a
n
s

h
a
r
v
e
s
t
e
d

9
4
a
50
45
40
35
30
25
20
15
10
5
0
9
16
23
12
6
". Are t)e variables ne*atively or positively related.
3. Can t)e relations)ip be summari>ed by a strai*)t lineB
!. -o$ muc) e''ect does K )ave on CB
,. -o$ )i*)ly clustered are points around a lineB
%. Are t)ere any *aps in t)e plot or do $e )ave data values coverin* t)e $)ole
ran*e o' K or CB
.. Are t)ere any outliers or odd observationsB
total amo#nt beans planted 94a
50 40 30 20 10 0 $10
t
o
t
a
l

b
e
a
n
s

h
a
r
v
e
s
t
e
d

9
4
a
300
200
100
0
$100
Simpliied hht!pe
emal e
mal e
=
Three or more variables
3)en t)ree or more variables are bein* investi*ated, cross tabulations become
sparse and di''icult to interpret and clear *rap)s di''icult to construct.
A simple e/ample o' t)e need 'or not al$ays considerin* ?ust t$o variables at a
time is *iven. In bot) Re*ion 1 and Re*ion " it is clear adoption is not related to
income 6%.I adopt in bot) )i*) and lo$ income *roups in Re*ion 1 and 33I in
Re*ion "7 but i' t)e sum o' t)e t$o re*ions is studied t)ere appears to be )i*)er
adoption in t)e )i*) income *roup.
+/actly t)e same t)in* occurs $it) continuous variables $)ere spurious correlation
6or lac4 o' it7 can be due to a t)ird variable $)ic) )as not been allo$ed 'or. 5ore
advanced *rap)ical 6e.*. small multiple pictures7 and numerical 6re*ression and lo*&
linear modelin*, multivariate met)ods suc) as principal components7 met)ods e/ist
to )elp t)ere.
12
Arti'icial
+/ample
Re*ion 1
Adoption
L M
Incom
e
1 12 "2
- "2 !2
Re*ion "
Adoption
& M
Incom
e
1 !2 "2
- "2 12
8verall
Adoption
& M
Incom
e
1 ,2 !2
- !2 ,2
pl anted 94a
pl anted 94b
harvested 94a
harvested 94b
11
Descriptive statistics - common problems
Use of standard techniques rather than the most appropriate.
An e/ample is t)e )isto*ram to s)o$ t)e distribution o' a continuous variable. #)e
)isto*ram s)o$s 'eatures suc) as location and s4e$ness. -o$ever, ot)er
possibilities are cumulative )isto*rams 6$)ic) s)o$ I points7, bo/plots 6*ood 'or
comparin*, and s)o$in* outliers7, 9&9 or normal probability plots 6to c)ec4 i' t)e
variable )as a normal distribution7 or stem&and&lea' plots 6to loo4 at individual
values7.
Ae ima*inative & 'ind t)e best $ay to display t)e in'ormation you $ant.
%isto"ram
&M'()94&
N
o
o
o
b
s
0
2
4
6
*
10
12
14
16
1*
20
22
24
26
2*
+= 0 ,0-5. ,5-10. ,10-15. ,15-20. ,20-25. ,25-30. ,30-35. ,35-40. / 40
0#mm#lative histo"ram
&M'()94&
N
o
o
o
b
s
0
4
*
12
16
20
24
2*
32
36
40
44
4*
52
+= 0 ,0-5. ,5-10. ,10-15. ,15-20. ,20-25. ,25-30. ,30-35. ,35-40. / 40
Non$1#tlier Ma2 = 7
Non$1#tlier Min = 0
753 = 3
253 = 0
Median = 1.75
1#tliers
42tremes
5o2 'lot
0
10
20
30
40
&M'()94&
6#antile$6#antile
Distrib#tion7 Normal
)heoreti8al 6#antile
1
b
s
e
r
v
e
d
9
a
l#
e
.05 .1 .25 .5 .75 .9 .95 .99
$10
0
10
20
30
40
50
$2 $1 0 1 2 3
Use of techniques you can get your computer to do.
5uc) statistics so't$are is very 'le/ible. I' you learn enou*) about it you can *et it
to do most t)in*s, but not everyt)in*.
Ae prepared to do some analysis, includin* dra$in* o' *rap)s or tables, by )and.
Concentration on means when variation is important.
1"
Cases $)ic) deviate 'rom t)e mean, contributin* to variability, are probably ?ust as
important as t)e avera*e values.
5a4e sure you understand $)et)er variation is important, and i' so, describe it.
Limited use of derived quantities.
It is unli4ely t)at eac) substantive 9uestion can be ans$ered 'rom columns o' ra$ data
alone. Calculations o' ne$ variables is certain to be important.
Calculate ne$ variables t)at are needed to ans$er t)e 9uestions.
Confusion over the unit of analysis.
5any datasets contain data collected at more t)at 1 level 6 e.*. plot, person, )ouse)old,
community7. Analyses must use t)e relevant level. 5i/ed levels are almost $ron*.
+ven in surveys $it) data collected at one level t)ere is room 'or con'usion
re*ardin*, 'or e/ample, calculations o' percenta*es.
Dariety Number o'
'armers plantin*
in =!A
Avera*e o' t)ose
'armers $)o planted
Ka$anda 11 ".!,
5anyi*amulimi "1 12.,3
Kanyeb$a 2 &
3)ite )aricot 2 &
All ot)ers 1! ".2!
No beans planted 10 &
#)e various interestin* percenta*es are(
Percent o' all 'armers plantin* Ka$anda N 11F,2 N ""I
Percent o' all 'armers $)o planted in =!A $)o planted Ka$anda
N 11F6,2&107 N 3!I
Percent o' amount planted t)at $as planted to Ka$anda
N 611 / ".!,7 F 611 / ".!, M "1 / 12.,3 M 1! / ".2!7 N
"%.=,F".%.%! N =..I
Not working with relevant subsets of the data
13
)ould t)e 'armer $)o never *ro$s beans be deleted 'rom t)e datasetB )ould cases
'or $)om 'armin* is not t)e main occupation be omitted $)en analy>in* economic
activityB
5a4e sure all relevant data, but no irrelevant data, is bein* used.
oor handling of outliers.
Ae on t)e loo4 out 'or all odd observations, $)ic) mi*)t represent mista4es or
unusual cases. 5ista4es must be corrected. #reatment o' unusual cases depends on
conte/t. Includin* t)em can distort t)e picture. 8mittin* t)em can induce bias.
!alance between "#ploratory analysis and $ata $redging
+/ploratory analysis means loo4in* 'or interestin* patterns in t)e data $it)out
'ocusin* on a speci'ic 9uestion 6e.*. G3)o are t)e 'armers $)o )ave )eard o' t)e ne$
varietyBH7. #)is can be valuable, and s)o$ up 'acts $)ic) )ad not been t)ou*)t o' or
)ypot)esi>ed.
Data dred*in* means searc)in* t)rou*) many statistics until :somet)in* turns up;.
For e/ample, doin* a cross&calculation o' G-eard o' ne$ varietiesH $it) every ot)er
9ualitative variable. #)e results $ill be spurious 6 i' you searc) t)rou*) enou*)
columns o' random numbers you $ill eventually 'ind :interestin*; correlations7.
#)e distinction bet$een t)e t$o approac)es is 'ineJ
Confirmatory analysis: estimation and
hypothesis testing
The problem
A. -ouse)old #ype
5ale Female
1abour
Never )ire
or e/c)an*e "3 13 3%
-ire or
e/c)an*e 12 3 13
33 1% !=
1!
In t)e #able A $e can see(
33I o' t)e )ouse)olds are 'emale )eaded.
32I o' male )eaded )ouse)olds )ire labour, but only 1=I o' 'emale )eaded )ouse)olds
do.
A. Farmers $)o planted beans in =! a
5ale Female 8verall
Amount 5ean %., ".= ,.0
Planted s.d. =., 1.3 0.%
n "! % 32
In #able A $e can see(
#)e mean amount o' beans planted in =!a by 'armers $)o *re$ beans t)at season is ,.0
4*.
#)e amount planted by males $as %., 4*, but only ".= 4* by 'emales.
All t)ese results are based on data 'rom a sample o' ?ust ,2 'armers in t)e district.
-o$ reliable are t)eyB I' $e )ad measured a di''erent ,2 )o$ similar $ould t)e
results )ave beenB I' $e )ad measured ,22, or t)e $)ole population, $ould t)e
conclusions )ave been muc) t)e sameB
#)e results di''er 'rom :true; ans$er 'or t$o reasons(
Non samplin* errors & incorrect responses, mista4es in codin* and data entry, poor
recall, biased selection o' respondents.
amplin* errors & t)ose due to t)e 'act t)at $e )ave measured only some 6a sample7
o' t)e population.
#)e non&samplin* errors can not usually be measured, but can be minimi>ed by *ood
survey practice. amplin* errors can be measured, and t)at is t)e purpose o' muc)
con'irmatory statistics.
Estimates, standard errors and confidence intervals.
1,
#roportions
#)e proportion o' 'emale )eaded )ouse)olds in t)e population is P. P is un4no$n.
#)e sample value is p N 2.33 6 N 1%F!=7. #)e uncertainty due to samplin* errors in
t)is is measured by t)e standard error. #)e standard error is se p
p p
n
6 7
6 7
=
1
,
$)ere n N sample si>e.
se6p7 is estimated by
. 6 . 7
.
33 1 33
!=
2.
=
#)is is t)e standard deviation o' possible estimates t)at could be produced by
di''erent simple random samples o' t)e same si>e.
#)e standard error is best interpreted via a confidence interval. A =,I con'idence
interval 'or p is p O " / se6p7
N 2.33 O " / 2.2.
N 62.1=, 2.!.7
#)is is interpreted as G3e are =,I con'ident t)at t)e true percenta*e o' 'emale
)eaded )ouse)olds is bet$een 1=I and !.IH. -ence t)e uncertainty in results due
to samplin* error is 9uanti'ied.
Means
#)e mean amount o' beans planted in =!a is ,.0 4*. #)e standard deviation o' t)is is
se mean
s
n
6 7 =
"
, $)ere s
"
is t)e variance in amount o' beans and n t)e sample si>e.
se mean 6 7
.
. = =
0 %
32
1 %
"
#)e =,I con'idence interval is
mean O " / se6mean7
N O " / 1.%
N 6".%, =.27
#)e mean amount o' beans planted is bet$een ".% and =.2 4*.
$i%%erences
I' interested in di''erences bet$een sub*roups $e can similarly estimate t)e
di''erence and 'ind a standard error o' t)e estimate.
1%
Di''erence in mean amount o' beans planted by
males and 'emales N %., & ".=
N 3.% 4*.
se difference
s
n
s
n
6 7 = +
1
"
1
"
"
"
N
= ,
"!
1 3
%
" "
. .
+
N ".2
=,I con'idence interval 'or di''erence is
3.% O " / ".2
6&2.!, ..%7
#)e mean di''erence bet$een amounts planted by males and 'emales could be
anyt)in* bet$een &2.! 4* and ..% 4*.
Hypothesis tests: The logic
#)e lo*ic o' all t)e tests commonly used depends on t)e 'act t)at random samples 'rom a
population be)ave in a predictable $ay. #)e mean amount o' beans planted by 'emale
)ouse)olds o' ".= 4*, is not t)e actual mean o' all )ouse)olds in t)e districts $)ere t)e study
too4 place. I' a di''erent sample )ad been randomly selected t)e mean $ould )ave been
di''erent. #)e 9uestion is :-o$ di''erentB;. I' all )ouse)olds are very similar 6lo$ variation
bet$een )ouse)olds7 t)en it really does not matter $)ic) sample is selected. 8n t)e ot)er
)and, )i*) variation in t)e population $ill lead to very di''erent sample means, and )ence
less certainty in t)e results obtained. #)e mat)ematics o' statistics allo$s 9uanti'ication o'
t)ese ideas, and )ence ans$ers to t)e 9uestion o' )o$ certain $e are o' t)e results.
#)e lo*ic o' t)e )ypot)esis tests is as 'ollo$s(
1. Assume some 'act is true & t)e null )ypot)esis 6e.*. #)ere is no di''erence in mean
amount o' beans planted by male and 'emale )eaded income )ouse)olds7.
". Deduce )o$ t)e sample $ould be)ave i' 617 is true 6e.*. -o$ bi* could t)e sample
di''erences bet$een male and 'emale )eaded )ouse)olds beB7
3. Compare t)e actual sample $it) t)e predictions in 6"7.
1.
!. I' 6"7 and 637 do not a*ree t)en 617 must be untrue & t)e null )ypot)esis is re?ected.
I' 6"7 and 637 do a*ree t)en t)ere is no reason, in t)is data, not to believe 617.
#)e level o' a*reement is measured by t)e Psi*ni'icance levelP, e/plained in t)e e/amples
belo$.
Examples of calculations
Chi&squared test %or no association in a ' ( ' table)
#a4in* #able A as an e/ample, $e $ant to test $)et)er t)e proportion o'
)ouse)olds )irin* labour is t)e same in male and 'emale )eaded )ouse)olds. #)e steps are(
1. Formulate t)e null )ypot)esis( t)e proportion is e9ual 'or bot) male and
'emale )ouse)olds.
I' 617 is true, t)en t)is proportion is estimated by 3%F!=. -ence $e $ould e/pect numbers in
eac) cate*ory to be (
10
5ale Female
Never )ire
33
3%
!=
"! " x = . 1%
3%
!=
11 0 x = .
-ire
3
13
!=
0 0 3 x = . 1%
13
!=
! " = .
3. #)e di''erence bet$een observed and e/pected 're9uencies is summarised as
!. I' 617 is valid t)en t)e value o'
"
s)ould be an observation 'rom a
1
"
&
distribution. Comparison $it) tables s)o$s t)at 2..! is not an e/treme observation. A
number at least as bi* as t)is $ould occur 3=I o' t)e time. #)e si*ni'icance level is p N
2.3=. -ence t)ere is no stron* reason not to believe t)e null )ypot)esis.
t&test to compare two means
In e/ample A t)e steps needed are(
1. Formulate t)e null )ypot)esis( t)e di''erence in mean amount o' beans
planted 'or male and 'emale )ouse)olds is >ero.
",3 I' 617 is true, t)en t)e di''erence in means o' 3.%4*, scaled by its standard
error
6N ".27 ,
t = =
3 %
" 2
1 0
.
.
. ,
is an observation 'rom a t
"0
distribution.
1=
2
2 2 2 2
=
( - 3 )
+
( - )
+
( - )
+
( - )
=
"! " "
"! "
11 0 13
11 0
0 0 12
0 0
! " 3
! "
2 .!
.
.
.
.
.
.
.
.
.
!. Comparison $it) tables s)o$s t)at 1.0 is not an e/treme observation. A
di''erence as bi* as t)is $ould occur 0I o' t)e time 617 is true. #)e si*ni'icance level is p N
2.20. -ence t)ere is not muc) reason not to believe t)e null )ypot)esis.
Limitations
*ssumptions)
#)e calculations in bot) !.1 and !." are based on a series o' assumptions. #)e 4ey
ones are(
Independence. In bot) e/amples A and A $e assume observations are independent.
1ac4 o' independence is caused by(
6i7 non&simple random samples. In t)is case $e )ave used a strati'ied sample.
6ii7 inter'erence bet$een observations. #)is $ould be t)e case i' individuals
$it)in t)ese )ouse)old responded, or i' data $ere collected at a *roup meetin*.
1ac4 o' bias due to non&response, intervie$er e''ects, attempts to PpleaseP t)e
researc)er etc.
+9uality o' variance and normal distribution 6t&test7. #)ese assumptions can be
c)ec4ed. In e/ample A t)e data is clearly not normally distributed
+imits to interpretation)
617 I' t)e result is :si*ni'icant; $e can re?ect t)e null )ypot)esis, and conclude
t)at t)ere is a real di''erence in t)e population. I' t)e result is :not si*ni'icant; $e )ave not
proved t)ere is no di''erence. It is never possible to prove t)e null )ypot)esis is true 6i'
almost never $ill beJ7. All $e can say is t)is study )as not produced evidence to ma4e us
disbelieve t)e null )ypot)esis.
6"7 At $)at level o' si*ni'icance s)ould t)e null )ypot)esis be re?ectedB ,I is
commonly used but t)ere is absolutely no reason $)y it s)ould be treated as a ri*id cut o''.
%I and !I si*ni'icance levels are, 'or all real purposes, e9uivalent.
637 3)et)er t)e null&)ypot)esis is re?ected depends as muc) on t)e sample si>e
and precision o' t)e study, as on t)e Ptrut)P o' t)e null )ypot)esis. A small, imprecise survey
$ill not detect a di''erence t)at could be pic4ed up by a lar*er study. 5ay be $e ?ust did
not collect enou*) dataJ
"2
6!7 #)e $)ole lo*ic o' si*ni'icance testin* and t)e p&value rests on $)at $ould
)appen in repeated surveys o' t)e same desi*n, usin* ne$ randomisations. Is t)is sense,
$)en $e 4no$ t)e survey $ould not and can not ever be repeatedB
6,7 In most analysis e/ercises, di''erences $)ic) Ploo4 interestin*P at t)e
e/ploratory sta*e are investi*ated 'urt)er in t)e con'irmatory analysis. I' t)e tests to
per'orm )ave been selected because di''erences loo4 lar*e, all si*ni'icance levels are
invalid.
6%7 I' a lar*e number o' tests are per'ormed, as is o'ten t)e case in analysis o' a
study $it) many variables, t)en $e $ould e/pect ,I o' t)e tests to *ive Qsi*ni'icantQ results
at t)e p N 2., level even i' all null )ypot)eses $ere true. -ence it can be di''icult to
interpret t)e results o' multiple tests.
hat should you do
617 #reat t)e si*ni'icance level p as an indication o' Pstren*t) o' evidenceP
a*ainst t)e null )ypot)esis, not as a CesFNo decision ma4er.
6"7 Concentrate on estimatin* t)e si>e o' di''erences, rat)er t)an ?ust testin*
$)et)er t)ey e/ist. Con'idence intervals 'or di''erences $ill be muc) more use'ul t)an
)ypot)esis tests.
At t)e end o' every si*ni'icance test apply t)e 8 3-A#B test. As4 yoursel' Po
$)atBP. -as t)e si*ni'icance test really improved your understandin* o' t)e situation
and )elped you ta4e a rational decision 'or 'uture actionB I' not 'or*et it, and *et on
$it) somet)in* more use'ul.
Confirmatory Analysis - Regression
!tarting "egression
& Ae$areJ
+ven :simple; re*ression is not simpleJ
& tart by considerin* types o' relations)ip t)at mi*)t e/ist. #)e most use'ul re*ression
analysis $ill be one t)at starts 'rom understandin* o' t)e t)eory be)ind t)e process bein*
studied.
"1
#)e e/ample used )ere is rat)er arti'icial. It e/amines t)e proposition t)at t)e amount o'
beans )arvested in =!a depends only on land area.
& Plot t)e data to see i' t)ere is any evidence o' t)e relations)ip.
(&ND&:4&
%
9
)
1
)
9
4
&
$20
20
60
100
140
1*0
220
$1 1 3 5 7 9 11
#itting the regression line
& o't$are is $idely available to do t)is
& <nderstand t)e outputJ

""
"3
* * * * M U L T I P L E R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
Equation Number 1 Dependent Variable.. HVTOT94A
total beans harvested 94a
Block Number 1. Method: Enter LANDAREA
Variable(s) Entered on Step Number
1.. LANDAREA
Multiple R .54425
R Square .29621
Adjusted R Square .28057
Standard Error 29.01659
Analysis of Variance
DF Sum of Squares Mean Square
Regression 1 15946.10384 15946.10384
Residual 45 37888.31105 841.96247
F = 18.93921 Signif F = .0001
------------------ Variables in the Equation
------------------
Variable B SE B Beta T
Sig T
LANDAREA 8.200238 1.884280 .544249 4.352 .
0001
(Constant) -2.863844 6.051297 -.473 .
6383
End Block Number 1 All requested variables entered.
$hec% the fit
& 1oo4 'or any unusual points or outliers. #)ey could represent mista4es or cases t)at
re9uire special treatment. #)ey certainly re9uire e/planation.
& 1oo4 'or in'luential points, $)ic) lar*ely determine results. #)ey are not a bad t)in*,
but you must be a$are i' your conclusions depend critically on one or t$o observations.
& 1oo4 at t)e residuals to determine(
1. 3)et)er t)ey satis'y t)e main assumptions t)at validate t)e analysis 6constant
variance, independence, rou*)ly normally distributed7
". 3)et)er t)ey s)o$ patterns accordin* to t)e value o' ot)er variables, indicatin* t)at
t)ose ot)er variables s)ould be allo$ed 'or in t)e analysis.
&nterpretation
:i*ni'icance; does not tell you $)et)er t)e 'itted model is lo*ically sound or i' it 'its
t)e data $ell.
:i*ni'icance; does not tell you $)et)er t)e model is use'ul in e/plainin* or
describin* a relations)ip, or i' t)e relations)ip )as muc) predictive po$er.
A re*ression model derived 'rom survey data can not tell you $)at $ould )appen
$)en a :/&variable; is c)an*ed. For e/ample $e can not use it to predict t)e bean
)arvest o' a 'armer $)ose land )oldin* c)an*es.
+/istence o' a re*ression relations)ip bet$een t$o variables does not mean t)ere is a
causal relations)ip.
Re*ression relations)ips become use'ul $)en similar relations)ips are 'ound in a number
o' di''erent conditions. 1oo4 'or :si*ni'icant sameness; bet$een re*ions, crops, 'arm
types, etc.
'dding more variables ( )ultiple regression
5ultiple re*ression is a po$er'ul tool 'or understandin* t)e relations)ip o' one
variable to several ot)ers. A<#.....
All t)e limitations to interpretation above apply, and are compounded by t)e
e/istence o' several :/&variables;.
It is )ard to dra$ *rap)s t)at s)o$ t)e relations)ips and t)e $ay data depart 'rom
t)em, so t)e analyst must rely more on numerical indicators o' lac4 o' 'it, outliers,
"!
and in'luential points. 5ultiple re*ression analysis $ill not be success'ul i' t)ese are
not understood.
:tep$ise; and similar variable selection tec)ni9ues, so loved by social scientists,
)ave little t)eoretical basis and can produce ans$ers $)ic) are very poor. Re*ression
modelin* $ill be most success'ul i' understandin* o' t)e underlyin* processes is
used to c)oose possible models, rat)er t)an relyin* on computer al*orit)ms.
#)e sample si>e re9uired 'or multiple re*ression analysis depends on t)e
:con'i*uration; o' t)e data 6in particular t)e ran*e o' t)e /&variables and correlations
amon* t)em7. #)e re9uired sample si>e 9uic4ly becomes lar*e as t)e number o' /&
variables increases. I' re*ression analysis is t)e part o' t)e principle ob?ectives o' t)e
survey, it mi*)t be possible to select t)e sample in a $ay t)at ma4es t)e analysis
more e''icient.
:a; resid#als vs. %%)<'42
%%)<'42
:
a
;

r
e
s
i
d
#
a
l
s
$*0
$40
0
40
*0
120
160
1 2
Interpretation
Interpret results. #)is does not mean :understand $)ic) e''ects are si*ni'icant; but
:understand and communicate $)at you no$ 4no$ about t)e problem;. Cou s)ould be
able to(
5eet t)e ob?ectives o' t)e study.
Clearly state $)at is t)e substantive ne$ 4no$led*e $)ic) as been *enerated.
)o$ )o$ t)is ne$ in'ormation and understandin* builds on $)at $as t)ere
be'ore. Does it(
o add more e/amples o' somet)in* previously 4no$nB
o mean t)at *eneral rules or principles can be stated $it) more con'idenceB
",
o allo$ predictions to be made 'or ne$ and important situationsB
o mean t)at current understandin* or t)eory )as to be substantially
modi'iedB
<se t)e 9uantitative in'ormation you )ave *enerated to ma4e 9uantitative
predictions about t)e lar*er picture.
#)e ultimate *oal o' t)e researc) is a development ob?ective. +/plain )o$ your
results )elp you to$ards t)at ob?ective, and $)at t)e ne/t steps $ill be.
Cour survey and its analysis cost t)ousands o' dollars. +/plain $)y t)is $as a
*ood investment.
Ans$er t)e :o $)atB 9uestion. 3)at can $e no$ do $)ic) $e could not do
be'ore you did your surveyB
References
Coe R 6"22"7 teps in urvey Analysis. Nairobi( ICRAF. 1,pp
C 6"2217 Approac)es to analysis o' survey data. Readin*( tatistical ervices Centre.
"0 pp
"%

Statistics in Survey Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics in Survey Analysis

Uploaded by

Copyright:

Available Formats

Statistics in Survey Analysis

3)at is t)e 1 Got)erH )ouse)old type in 9uestion "B

You might also like