You are on page 1of 9
STANDARDS for Educational and Psychological Testing 2. RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT BACKGROUND A test roadly defined sa se of eas or stimal designed to clic eaponse that provide a sample fof an examiner’ behavior o performance in specified domain. Coupled wih the textes coring procedure that enables dhe sore eo erate the lchavioe or wosk samples and generate scr In Inrpreting and sing est Scots, ii important to have some indcstion of hi cli, ‘The tem reliability as been wed into ways Inthe meaturement Ieratre Ft the eras been used to refer wo the eibilty coefcene of lascal est chor defined asthe comtlation be tween scores om two equiaene forms of theres, presuming that aking one forn has no effect on performance on she second fm. Socond, the term has been used in a more gener nse, 10 refer othe eonsnency of scores aos seplations of testing procedure, estes of how chi con- sistency is estimated or epoted (eg in tes of standard enor, liability coficients er, gen eralizbiley coefcients, etrrlolrance ratios, ite sponse cheory TR) information functions, (oF various indices of castifation consizency) ‘To maintain ink tothe waitonal noone of reliably while aiding ehe ambiguity inherent in using singe, Familie resto refer co a wide ‘ang of concept and indies, we te the tem = lbp to denote more genial ation of consistency of the sore acroes inances of the ‘etng procedure, and dhe tr real oicent to ele wo she reliblsy coffins of classical vest theory The relsilyfpreciion of meisucement i slaysimportane However, the need for precision increases athe consequences of deskions andi terpreations gow in importance, fa tet score leads toa decison thas not eatly reversed, sch 2 ejection or admision of s candidate o& pro- fesional schol ofa score bas iia judgment (gs in legal contet) hata scious cognitive injury war sustained, a higher degree of ‘liablyprecison i warranted, fa dain can and will be cortborted by information fom other sources or if an ewoacous initial decision canbe easly corrected cores with more modest relibiltyprecsion may slice Tauerpreations of tes score gency depend on assumptions that nvidia and group xi. some degre of consistency in their acres across independent administrations ofthe testing pro- cedure, Howeves dfleent sample of pcformance fiom the same personae rarely identi. An in Alvida’ performances, produce, nd respontcs to sets of tsk or test questions vary in quality or character fom one sample of tasks co another, and from one occasion to another, even ander stil coatoledcondrions. Different ates my ard diferent sores to 2 specific performance All ofthese source of vration are reflected in the examines sors, which wil vary acres i seancs of x measurement procedut ‘The elbileypreciion ofthe scores depends ‘on how mach the scores vary actos eeplctions of the testing procedure, and analyser of reliablyprecision depend oa the kinds of vat abil allowed in the testing procedure (eg, over, ‘ass, contexts, atts) andthe proposed interpre: tation of che et cores For example ice inte pretation ofthe score assumes thatthe const being assed docs nor vary over acasions he variably over occasion 4 potential source of measurement erot thee tasks vary over a= ternate forms ofthe tex, and the observed per- formances are weated 3 sample from a domain of similar tak, the random vib in scones fiom one form to another would be considered 2 I tee at sed to stign sores to respons, the variability in sores over qualified mers soure of eror. Vatacons in 4 test eke sconce thar are noc consent with the definition of the construct being assed ae abuted to eons ‘of mesturement caneren2 ‘A very basic way 0 evaluate the consistency of sore involes an analysis ofthe variation in cach tes ker scores across replications of the ‘esting procedure. Th tes i administered and then, ater bie period during which the exam ine’s sanding on the vatiable Being measured ‘would not be expected to change, the test or 8 distnce bu equivalent Fen ofthe tes) adenine ceed 2 second time; ii sumed thatthe fst. sdminiswation har no influence on the second administration. Given chat the atibute beng ‘measured is assumed co remain the same foreach test taker over the wo administrations and that the test administrations are independent ofeach ‘othr, more vation crs the wo aminsaions indicates more ror ia the test sos and therefore lower relailaypreciion, ‘The impact of such messirement eros ean besummarized ina number af ways, bu typical, ineducational and paychological measurement it ‘conceptual in tems ofthe sand deviation inthe cores for 3 person ove ceplcations ofthe testing procedure. In mos reting conte cis ‘ot poste co replicate the texing proce re- patel, and therfore isnot possible to etimate the standard error for each persons score via repeated measurement. Tasted, using model ‘sed astumptions, the average ero of measure ‘ment is estimated over some population, and his average is telerred 0a the senda err af meas srement (SEM). The SEM is an indiewor of = lack of constency in the sores generated bythe texting procedure fr some population A rctncy lage SEM indicaer rainy low relly son. The conditional andar ror of meron farascore lel isthe standard cor of mesutcment ax tha score level To say tha a score includes err implis chat there sa hypothetical enortce vue that cate scteries the variable being siete, In cased est theory this enor-fee vale is refered to ab the person's ue soe forthe test procedure. eis ‘concepuntzed asthe hypothetical average score ‘over an infinite et of replications ofthe testing Procedure. In statistical terms, 4 perons ue Score isan usknown parameter, or conan, and the observed score forthe person ita random atl char ucaer around dhe ere sore for the person, Generlizabiley theory provides «diferent fiamework for estimating reliably precision, While casicl eet theory asumes a single di tribution forthe crore in a tet ears sore, gzneralizabilicy thoory socks ro evaluate the con tubutions of diene sources of eror (tems, ‘occasions rates) to the over eto The unten score fora person is defined athe expected valu ‘over 2 universe ofall posible replications of the ‘esting procedure forthe tes as, The univesse score of generalizability theory plays a tle chat x similar the role of tue sore in elas tet theory, tem rome teary (IRT) adds the basie fswe of relbili/ precision wing information fanccons, which indice the precio with which observed abfitem performances can be wed to ‘timate che value of ltent tit foreach tex taker Using IR, indices analogous co radional reliability coefciens ean be estimated fom the item information funetions and dsibutons of the ene aie in some population, In practice, the reliabilsylprecsion of the scores is ypicaly crusted in terms of vais colicin, including einbiltycoeficiens, pen eralizably coefcent, and IRT information Fantions, depending on the Focus ofthe anals snd the measirement model Being sed. The co cllcens tend to ave high values wien the vari- thiliyasocited with the errs sal compared with the observed variation in the sores (or score Alfrences) tobe estimated, Implications fr Validity Ahough linen decd er ‘sine cherie fc olde at elt Scona cot sings ola “sores of da ly eo seni or dept of eae ‘Sh nn of cit ‘tit Sel fom se oT ee {toe nr cose so hss Fe tng swe whee et they rele random errors of measirement thei potential for accurate prediction of eters, for beneficial examine diagnosis, and for we decision aking i iite, ‘Specifications for Replication. ofthe Testing Procedure ‘ince eae he pee aon fii! precios deine interns of constancy vet ‘pti f he ting prea, Raby Chon i high if the ve fr each pon are comin ont mpliton of eng pce Sn is ow ifthe crete no comet vet ‘olin Ther nein bie on i inportan wo be tle about what connie epiaion ofthe ing procedure Replicas ive independent wii cons ofthe texing procedure sch ta the se being mesured wouldnt be expected to change. For camp, naering an aba the need change ove an xed pete of ine (gin menting oes ented on ow sce dae ning diferent tex fmf appropri) old be sonieed rolicion Fora sate varie (2g, mood or Ing) where iy pd changes common, soe neared on two scene dae woul tore onerseplaons te or obaned tn each cen Would einer in ets oi teva ofthe sae vale on ta onion For many kml ol she ai inrton falcmate fms x ih ile ‘unl often woul be considered eplctons of tee fr surname and ome pr ‘only mee i expected that te me Gestion wil be ed erty ne the ea tine nd any aban hangs woding oul conte ces orm Sedan ete pret neo voy sini es tee al eke, mann dove adherence splat procera est Mninieton, sed employ prerbed sing tuk thc ane pled wih high dap of corse Adminteing these qc oF commonly ald question ol et her ner theme conto promotes mes nde ELAILTY/PRECISION AHO ERRORS OF MEASUREMENT comparisons of scores acs individuals Conions fof observation that are fied standardized for the ting procedure remain the same acrous replications, However, some apes of any stan dazed esting procedure wil be allowed to vary. ‘The time and place of testing, as well asthe pons aminisering the tet, ae gener allowed (0 vary 10 some extent. The paricular tasks included in the tse maybe allowed to vary (as, ‘amples om a common content domain), and the persons who sore the results can vary over some set of qualified scorer Alternate forms (or parle! fms) of ase Alaris tes ae designed ro have the same genera disribucon of content and item formas (as de> ‘rhe, forecampl in deed rex spications), the sume administrative procedure, and atleast, approximately the same score means and standard Jevitons in some specified population oF pop lations. Alternate foun of text age considered interchangeable in che sense that they ae built © the same specication, and are interpreted as rmeasre ofthe sme consi, Tn clas tes heoysialy pale tess ae sumed to measue the sme comsruct and

You might also like