Journal of Geochemical Exploration, 3(1974)129--149
Elsevier Scientific Publishing Company, Amst erdam -- Printed in The Netherlands
SELECTION OF THRESHOLD VALUES IN GEOCHEMICAL DATA USING PROBABILITY GRAPHS A.J. SINCLAIR Department of Geological Sciences, University of British Columbia, Vancouver, B.C. (Canada) (Accept ed for publication November 7, 1973) ABSTRACT Sinclair, A.J., 1974. Selection of threshold values in geochemical data using probability graphs. J. Geochem. Explor., 3: 129--149. A method of choosing threshold values between anomalous and background geo- chemical data, based on partitioning a cumulative probability plot of the data is described. The procedure is somewhat arbitrary but provides a fundamental grouping of data values. Several practical examples of real data sets that range in complexity from a single popula- tion to four populations are discussed in detail to illustrate the procedure. The method is not restricted to the choice of thresholds between anomalous and back- ground populations but is much more general in nature. It can be applied to any polymodal distribution containing adequate values and populations with appropriate density distribution. As a rule such distributions for geochemical data closely approach a lognormal model. Two examples of the more general application of the method are described. INTRODUCTION Tennant and Whi t e ( 1959) wer e among t he first t o r ecogni ze t he useful ness of pr obabi l i t y graph paper f or conci se visual r epr es ent at i on of geochemi cal dat a. Si nce t he appear ance of t hei r publ i cat i on pr obabi l i t y paper has been used s ome wha t spasmodi cal l y, but wi t h increasing regul ari t y f or graphi cal r epr es ent at i on and anal ysi s of many t ypes of geochemi cal dat a. In part i cul ar, Williams ( 1967) and Lepel t i er ( 1969) have emphasi zed t he ease wi t h whi ch such pl ot s can be used f or rapi d, graphi cal anal ysi s of large quant i t i es of dat a. Bol vi ken ( 1971) st at es t hat pr obabi l i t y graphs are now used r out i nel y by t he Nor wegi an Geol ogi cal Sur vey as an aid in i nt erpret i ng geochemi cal anal yt i cal results. Woods wor t h ( 1972) makes ext ensi ve use of pr obabi l i t y pl ot s as t he basis f or a t hor ough st at i st i cal anal ysi s of a bout 2000 r econnai ssance st r eam sedi ment anal yses f r om an expl or at i on pr ogr am in cent ral British Col umbi a. Numer ous ot her exampl es coul d be ci t ed. None of t hese papers, however , t r eat s in det ai l t he pr obl em of usef ul and ef f i ci ent sel ect i on of t hr eshol d values. Thr eshol d is a t er m used t hr oughout t he mi neral expl or at i on i ndust r y t o 130 signify a specific value t hat effect i vel y separates high and low dat a values of f undament al l y di fferent charact er t hat refl ect di fferent causes. Commonl y, t he t erm is applied t o a value t hat distinguishes an upper or anomal ous dat a set from a l ower or background set. For many t ypes of data, part i cul arl y t hose of a geochemical nat ure, anomal ous values are related t o mineralized rock. Consequent l y, t he choi ce of a t hreshol d value has considerable impor- t ance in directing expl orat i on t o specific anomal ous sample sites where t he chances of discovery of an economi c mineral deposi t are greatly enhanced. Threshol ds in geochemi cal dat a are chosen in a variety of ways. A met hod r ecommended in several publ i cat i ons involves t he est i mat i on of t he mean and st andard deviation of a dat a set with an arbi t rary choice of a t hreshol d at a value correspondi ng t o the mean plus t wo standard deviations (see Hawkes and Webb, 1962; Lepeltier, 1969). In some cases this procedure might be adequat e but it ignores t he fact t hat no a priori reason exists for exact l y t he upper 21~% of every data set being anomalous. Fur t her mor e, t he met hod does not take i nt o account adequat el y, t he fact t hat anomal ous and background popul at i ons have fairly extensive ranges of overlap in some cases, and as t hey are t wo popul at i ons t he mean and standard deviation derived from t he whol e dat a set really have no statistical validity and are j ust numbers. These failings are recogni zed by many field pract i t i oners who rely on subjective visual exami nat i on of histograms of dat a sets t o choose t hreshol d values. A t hi rd approach is t o defi ne t hreshol ds at poi nt s of maxi mum curvat ure in cumul at i ve probabi l i t y plots (e.g. Woodswort h, 1972). The pr ocedur e entails approxi mat i ng segments of a probabi l i t y curve by straight lines and picking t hreshol d values at ordi nat e levels t hat correspond t o intersections of these "l i near " segments. At best, this met hod is approxi mat e, at worst it can result in a high pr opor t i on of anomal ous values going unrecognized. Obviously, a pr ocedur e is desirable for choosing t hreshol d values t hat maximizes t he likelihood of recogni t i on of anomal ous values and minimizes t he number of background values included with anomal ous data. Cumulative probabi l i t y plots provide an effective graphical means of meet i ng t hese ends. PROBABILITY PAPER Ari t hmet i c probabi l i t y paper is a special kind of commerci al l y available graph paper generally designed with an ari t hmet i c ordi nat e scale and an unusual abscissa scale of probabi l i t y (or cumulative frequency percent ) arranged such t hat a normal (gaussian) cumul at i ve di st ri but i on plots as a straight line. Lognormal probabi l i t y paper differs onl y in t hat t he ordi nat e scale is logarithmic. Ari t hmet i c values of a single lognormal di st ri but i on grouped in exact l y t he same manner as requi red for the const ruct i on of a cumul at i ve histogram, pl ot as a straight line on log probabi l i t y paper. A bi modal di st ri but i on consisting of t wo lognormal popul at i ons plots as a curve. Exampl es of a single l ognormal di st ri but i on and bi modal l ognormal distribu- tions are shown in Fig.1. In these examples, and t hr oughout t he remai nder 131 JO 00 W , i x ' ~ 5 O O 3 0 C iO0 Of" 3C C~ 2 ~0 30 50 70 90 98 99 PROBABILITY (cure. %) Fig.l. Examples of unimoda] and bimoda! real distributions plotted on logarithmic probability paper. of this paper, values are cumulated for plotting by starting at the upper or high value end (cf. Lepeltier, 1969). The probability scale is taken as the abscissa because most commercially available probability paper in North America is arranged in this manner. There are numerous advantages to probability plots t hat are wort h noting here: (1) The form of density distribution of a dat a set can be examined. (2) Parameters of normal and lognormal popul at i ons can be estimated rapidly and with adequat e accuracy for most sets of geochemical data. (3) Several data sets can be represented on a single graph with much greater clarity t han multiple histograms. (4) Plots of several dat a sets can be compared visually for rapid recognition of similarities or differences. Addi t i onal advantages resulting from the ability to part i t i on pol ymodal distributions into their individual populations will become apparent in examples presented later. Of course, there are limitations to these plots as well, t hat must be recognized: (1) dat a might not have normal or lognormal distributions; (2) const ruct i on of a probability graph normally requires a mi ni mum of about 100 values, although techniques are available for dealing with fewer dat a (see Koch and Link, 1970); (3) scatter of dat a on a probability plot can be t oo great to permit a confi dent analysis of the data. Despite these limitations a high proport i on of geochemical data sets can be analysed usefully and confi dent l y on probability graph paper. 132 PARTITIONING OF POLYMODAL DISTRIBUTIONS Partitioning refers to met hods used to extract individual popul at i ons from a pol ymodal distribution consisting of a combi nat i on of t wo or more popula- tions. The met hods are not well described in the literature but are referred to, or implied by various writers (e.g., Harding, 1949; Bolviken, 1971). Cassie (1954) and Williams (1967) describe partitioning procedures briefly but their publications are not widely available. Consider t he case of a bimodal distribu- tion: providing t hat populations in the data set have normal or lognormal density distributions and are pl ot t ed on appropriate probability paper, an estimate of their proport i ons is given by an inflection point or change in direction of curvature on the probability curve (Harding, 1949). For example, in Fig.2, an inflection poi nt at the 20 cumulative percentile, indicated by an arrow, shows t he presence of 20% of a higher popul at i on A, and 80% of a lower popul at i on B. The form of t he curve is characteristic of t wo overlapping populations, a relatively gently sloping central segment indicating considerable overlap of t he t wo. 200 ~ ] ~ i , ,oo ~ ~ ~ A 8o -:~L ~ - ~ - - - . " ' 7 8 ~ , E 60 ~ ~o . . . . . . . . . . . . . . . . . . ~- . ~__ . . . . . . . . . . . . _- ~ A = B I :2C~801 2 0 I I I I L i I -5 2 IO 30 50 70 90 98 9~5 PROBABILITY ( cur n. %) Fi g. 2. T wo i deal i ze h y p o t h e t i c a l p o p u l a t i o n s A and B ar e c o mb i n e d i n t he p r o p o r t i o n s A / B = 20/ 80 t o p r o d u c e t he i n t e r me d i a t e cur ved d i s t r i b u t i o n d r a wn t h r o u g h cal cul at ed points shown as solid dots. An inflection point is shown by the arrowhead. Arbitrary thresholds at the 1% level of B popul at i on and the 99% level of A popul at i on correspond to 78 and 44 ppm, respectively. The uppermost pl ot t ed poi nt on t he curve at t he 180-ppm ordinate level represents 1% of t he total data. However, it also represents (1/20 X 100) = 5 cumulative percent of popul at i on A because at this ext remi t y of the dat a set there is no effective cont ri but i on from popul at i on B. Consequently, a poi nt on A popul at i on is defined at 5 cumulative percent on t he 180-ppm level. In the same manner, t he poi nt pl ot t ed on the curve at t he 150 ordi nat e level represents (2. 6/20 X 100) = 13 cumulative percent of popul at i on A and a 1 3 3 second poi nt on popul at i on A is obtained. This procedure is repeated unVil sufficient points are obtained t o define popul at i on A by a straight line or until t he repl ot t ed points begin to depart from a linear pat t ern indicating t hat popul at i on B is present in significant amount s. When sufficient points are obtained, a line is drawn t hrough t hem as an estimate of popul at i on A. Popul at i on B can be obtained in precisely the same way, providing the probability scale is read as compl ement ary values, e.g., 90 cumulative percent is read as (100 - 90) = 10 cumulative percent. Calculated points for bot h A and B popul at i ons are shown as open circles in Fig.2. Validity of t he t wo-popul at i on model can be checked by combining t hem in t he proport i ons 20% A and 80% B at various ordinate levels. In this hypot het i cal example, check points are not shown because it has been con- structed ideally. Throughout the remainder of the paper, however, check calculations are indicated by open triangles. The checking procedure involves the calculation of ideal combi nat i ons of the partitioned populations at various ordinate levels using t he relationship PM = f APA + fBPB where PM, the probability of the "mi xt ur e", is to be calculated (see Bolviken, 1971); PA and PB are cumulative probabilities of popul at i ons A and B read from t he graph at a specified ordinate level; fA is the proport i on of popul at i on A, and fB = 1 - fA is t he proport i on of popul at i on B. In practice, several trials might be necessary to obt ai n a good fit of the ideal mi xt ure with t he real dat a because of t he di ffi cul t y in defining t he inflection poi nt accurately. In most cases, t he partitioning procedure is as straight forward as outlined. In ot her cases, a slight modi fi cat i on is necessary when dealing with real dat a as will become apparent in some of the examples t hat follow. Partitioning of pol ymodal curves containing three or more populations is somewhat more complex but is done in an analogous way, proceeding in stages. Generally, partitioning begins with the populations represented by t he extremities of the probability curve, followed by partitioning of more centrally located populations. Note t hat in this idealized example, parameters of the individual partitioned popul at i ons can be estimated. The geometric mean of each can be read at the 50 percentile and t he range including 68% of the values can be det ermi ned at the 84 and 16 cumulative percentiles. This range encompassing 2 standard deviations is asymmet ri c about the geometric mean. The met hod of represen- t at i on adopt ed here is to quot e t he geometric mean, followed in brackets by t he range t hat includes 68% of t he values. These parameters for the part i t i oned popul at i ons A and B are 100 (144, 71) and 42 (55, 33), respectively. Estimates of t he arithmetic mean and variance can be obtained from this i nformat i on as described by Krumbei n and Graybill (1965), but normal l y are not required. CHOICE OF THRESHOLDS The hypot het i cal example in Fig.2 illustrates a common general situation 134 of high and low popul at i ons with an effective range of overlap. If no significant overlap of values existed, t he cent ral moder at el y steep segment of t he curve woul d be nearly vertical and a single t hreshol d coul d be chosen rapi dl y at its mi d-poi nt . In t he general case, however, choi ce of t hreshol ds is mor e compl ex. Consider 2 t hreshol ds chosen arbitrarily at t he 99 and 1 cumul at i ve percentiles of the part i t i oned popul at i ons A and B, respectively of Fig.2 (recall t hat A and B are present in t he ratio A/B = 20/ 80). These percentiles divide t he data i nt o 3 groups at t he 44- and 78-ppm ordi nat e levels. 16% of t he t ot al dat a is above t he upper t hreshol d of 78 ppm. In a hypot het i cal sample of 100 values, this upper group woul d consist appr oxi mat el y of 15 values f r om A popul at i on and 1 value f r om B popul at i on. The l ower group bel ow 44 ppm cont ai ns 46% of t he t ot al data. It consists of 1% of popul at i on A (at most , 1 value in this case) and 57% of popul at i on B (about 46 values). The i nt ermedi at e group bet ween t he t wo t hreshol ds cont ai ns about 38% of t he t ot al dat a consisting of 42% of t he B popul at i on and 33% of t he A popu- lation. In our hypot het i cal sample this corresponds to about 6 or 7 A values and 33 or 34 B values (Table I). TABLE I Total data A population B population % No.* % No.* % No.* Group I 16 16 76 15.2 1 0.8 Group II 38 38 23 4.6 42 33.6 Group III 46 46 1 0.2 57 45.6 100 100 100.0 20.0 100 80.0 *Sample = 100 of which 20 are A and 80 are B population. The procedure, al t hough arbi t rary, has thus divided t he dat a rat her effec- tively into t hree groups, t wo of which cont ai n significant pr opor t i ons of t he upper A popul at i on and a t hi rd t hat almost exclusively represent s t he l ower B popul at i on. Let us assume for t he moment t hat A and B represent anoma- lous and background popul at i ons, respectively. The upper group above t he upper t hreshol d can be consi dered t op pri ori t y for follow up exami nat i on because pract i cal l y all values are anomalous. Lower pri ori t y can be at t ached t o values in t he i nt ermedi at e group because al t hough it cont ai ns virtually all remaining anomal ous values, an increased amount of expl orat i on manpower per anomal ous sample is requi red t o check t hem and sort t hem out f r om background values in t he same range. There is not hi ng sacrosanct about t he percentiles used t o defi ne thresholds. In this case, values were chosen t hat corresponded with 99 and 1 cumul at i ve percentiles of t he A and B popul at i ons, respectively. Threshol ds coul d equally 135 well have been defi ned by t he 98 and 2 cumul at i ve percentiles of t he appropri at e part i t i oned popul at i ons. Whatever choice is made, it is possible t o det ermi ne estimates of t he pr opor t i ons of each popul at i on occurring in t he groups t hus delimited. In t he writer' s experi ence, t he t wo sets of figures ment i oned above have proved most useful but di fferent values coul d be chosen dependi ng on t he nat ure of t he dat a and the requi red probabi l i t y t hat all anomal ous values be ret ai ned in t he upper t wo groups. Not e t hat in this hypot het i cal but t ypi cal case t he choice of a t hreshol d at t he mean plus t wo standard deviations woul d have placed most of t he anom- alous values with background. The same effect woul d be obt ai ned wi t h a common variation of this procedure, t he assumpt i on t hat t he upper 21/2% of values are anomal ous. Were t he probabi l i t y curve appr oxi mat ed by t hree linear segments, t hei r intersections woul d have provi ded thresholds at appr oxi mat el y 103 and 55 ppm. The common pr ocedur e of adopt i ng t he upper value as t hreshol d woul d result in rej ect i on of more t han 50% of t he anomal ous values. Even t he choi ce of t he l ower value woul d result in rej ect i on of about 5% of anomal ous values. Zn IN SOILS, TCHENTLO LAKE AREA (CENTRAL BRITISH COLUMBIA) Fig.3 is a probabi l i t y graph of 173 zinc analyses of B hori zon soils t aken on a grid pat t er n in an area of known Mo--Cu mi neral i zat i on near Tchent l o Lake in central British Columbia. Underlying rock is a t ext ural l y and mineral- ogically uni form, well-jointed diorite. Joi nt s are mineralized, principally with quart z and pyri t e, but in some places mol ybdeni t e is abundant and small amount s of chal copyri t e occur. A t hi n layer of overburden covers the area except for sporadic out cr op knolls. ~ N:173 220 -- b = 87 b + S L = 1 4 0 b-SL= 5 5 ,oc % 3 0 I I I I i I I 0 3 0 5 0 7 0 9 0 9 8 9 9 PROBABILITY (cure. %) Fig.3. PrObability plot of 173 values of Zn in B zone soils, Tchentlo Lake, B.C. Listed parameters of the distribution were obtained from the straight line drawn through original data points (solid dots). 95% confidence limits are shown after Lepeltier (1969). 1 3 6 The probabi l i t y pl ot is linear if one neglects slight divergences at the extremities, t hat commonl y result from sampling error. Consequent l y, an estimate of t he distribution can be obtained by a straight line t hrough the pl ot t ed points. 95% confi dence limits of the popul at i on were det ermi ned graphically (cf. Lepeltier, 1969). Woodsworth (1972) suggests t hat a useful procedure for recognizing significant curvature in a probability graph is to assume t he presence of a single popul at i on and const ruct its 95% confi dence belt. Significant curvature to t he pl ot is assumed at points t hat plot outside the zone of 95% confidence. None of the pl ot t ed points for Tchent l o Lake dat a lie outside the band defined by t he 95% confidence limit suggesting t hat onl y a single popul at i on is present. In this case, the range of values and the form of the probability graph suggest t hat the dat a represent a single background population. A wise proce- dure, however, is to assume t hat t he few highest values are anomal ous until proven otherwise. This is a convenient safety precaution in cases where anomal ous values are present in t oo low proport i on to define a second popu- lation. To standardize a procedure for dealing with such data, it is convenient to pick an arbitrary t hreshol d at an ordinate level corresponding to the mean plus 2 standard deviations as recommended by Hawkes and Webb (1962). This procedure assumes t hat approxi mat el y the upper 21/~% of values are anomal ous until shown otherwise, and should be applied only when a single popul at i on is indicated from an exami nat i on of the probability graph. In this example, the upper 5 zinc values were found t o plot on a plan of t he grid, sporadically, but away from known mineralized areas. Cu IN STREAM SEDIMENTS, MT. NANSEN AREA (YUKON TERRITORY) Copper analyses for 158 stream sediment samples from the Mt. Nansen area, Yukon Territory, are shown as a probability plot in Fig.4 (see Bianconi and Saagar, 1971). A smoot h curve t hrough the data points has t he form of a bimodal densi t y distribution with an inflection poi nt at the 15 cumulative percentile. The curve was part i t i oned using the met hod described previously to obtain popul at i ons A and B whose estimated parameters are given in Table II. The partitioning procedure was checked at various Cu ppm levels by combining the t wo part i t i oned populations in the proport i on of 15% A and 85% B. Check points are shown as open triangles on the Figure and are seen to coincide with t he real dat a curve. In this case, some high values are associated with known Cu--Mo mineralization related to porphyri t i c intrusions and it seems reasonable to interpret t he two populations as anomal ous (A) and background (B). Two arbitrary threshold values can be det ermi ned readily from t he graph at the 1.0 and 99 cumulative percentiles of the B and A populations, respec- tively. These percentiles coincide with 70 and 37 ppm Cu, respectively. Hence, the dat a are divided into 3 groups, an upper group of predomi nant l y anomal ous values, a lower group of predomi nant l y background values, and an intermediate 137 ,oo ~ % ~ Is ' /o ) =100 ~j a. 50 JO I I , ~ I 2 I 0 30 50 ?0 9 0 9B 99 P R O B A B I L I T Y { cur e, %) Fi g. 4. Bi mo d a l p r o b a b i l i t y p l o t of 158 Cu ' s i n s t r e a m s e d i me n t s , Mt . Na n s e n , Yu k o n . Op e n ci r cl es ar c p a r t i t i o n i n g p o i n t s us e d t o e s t a bl i s h p o p u l a t i o n s A a n d B. Op e n t r i a ngl e s ar e c h e c k p o i n t s o b t a i n e d b y c o mb i n i n g A a n d B i n t h e r a t i o 1 5 / 8 5 . T ABL E I I Es t i ma t e d p a r a me t e r s o f p a r t i t i o n e d p o p u l a t i o n s , Cu i n s t r e a m s e d i me n t s , Mt . Na n s e n a r e a ( Yu k o n Te r r i t o r y ) P o p u l a t i o n P r o p o r t i o n No. o f Va l ue s i n p p m Cu ( %) s a mp l e s b b + s L b - s L A: a n o ma l o u s 15 24 101 155 63 B : b a c k g r o u n d 85 134 14. 7 28. 5 7. 4 A + B 100 158 group cont ai ni ng bot h anomal ous and background values. Of t he 158 values, about 23 are anomal ous, and 135 are background. 80% or about 18 of t he anomal ous values are above t he 70-ppm t hreshol d; and 5 are bel ow it, for all practical purposes, in t he i nt ermedi at e range. Of the 135 background values, 91.5% or 124 values, are bel ow t he lower t hreshol d, the remaining 11 back- ground values are above t he l ower t hreshol d in t he i nt ermedi at e range. Consequent l y, anomal ous values occur in onl y t wo ppm intervals t o which priorities can be assigned for follow up expl orat i on. Virtually all values above 70 ppm are anomal ous and have t op pri ori t y. Second pri ori t y is assigned t o t he 16 values in t he i nt ermedi at e range, about 5 of which are anomal ous. 138 Theoretically, individual values t hat lie bet ween the t wo thresholds cannot be assigned to either A or B populations. Therefore, since onl y about 1 in 3 is anomal ous in this range, about three times as much work is required to check each anomal ous sample as is required for values above 70 ppm Cu; hence, the reason for assigning priorities to the two groups. In practice, some of t he anomal ous values in this central range can be recognized with a fair degree of certainty. For example, a number of t hem might be expected to occur down stream from top priority anomal ous samples. This sort of geographic relationship stands out particularly well if samples are colour-coded as to group, on a plan of the sampled streams. In many cases, virtually all samples in t he i nt ermedi at e range can be identified in this manner with a fair degree of cert ai nt y. A comparable procedure can be used when dealing with soil or whole rock analyses for which t wo thresholds are det ermi ned. Those intermediate range samples t hat group geographically with known anomal ous samples commonl y can also be considered anomalous. In this way, follow-up exami nat i on of second priority anomalies can be cut to a mi ni mum and in many cases avoided compl et el y. Ni IN SOILS, HOPE AREA (SOUTHERN BRITISH COLUMBIA) Fig.5 is a log probability graph of 166 Ni analyses of soils obtained from a grid superimposed on a known Cu--Ni mineralized zone. The mineral showing is associated with ultramafic rocks enclosed in regionally met amorphosed fine-grained clastic sedi ment ary rocks, near Hope in sout hern British Col umbi a A smoot h curve drawn t hrough t he dat a points has t he form of at least three popul at i ons based on inflection points at 5.5 and 25 cumulative percentiles. The A and C popul at i ons were part i t i oned using the met hod described in a previous section. Popul at i on B was t hen estimated using the relationship: PM = f A P A + f B P s + f c P c In this equat i on: fA = 0.055, fB = 0.195, fc = 0.75 and PM, PA, PC can be read from t he graph for any ordinate level. Hence, PB is the onl y unknown and can be estimated for various ordinate levels, pl ot t ed, and an estimate of popul at i on B det ermi ned by passing a straight line t hrough t he calculated points. The three part i t i oned populations A, B and C were t hen combi ned ideally in t he proport i on: 5. 5/ 19. 5/ 75 for a number of ordinate values, to check t he partitioning procedure. These check values are shown in Fig.5 as open triangles t hat almost coincide wi t h the smoot h curve t hrough the original data. Popul at i on A is obviously not well defined as indicated by the scatter of points about its linear estimator. The reason is t hat onl y a small proport i on of t he t ot al dat a represents popul at i on A, t hus its estimation by partitioning is based on very few dat a points -- four in this case. Populations B and C appear well defined, principally because their ideal combi nat i on in t he ratio 19. 5/75. 0 agrees with t he real dat a curve. Estimated parameters of t he three 139 ~OOC A 5"5% b =1170 b+SL= 1380 b - S L = 9 8 0 _ B , 9 . 5 - / . E b : 5 5 6 ~. b + s L : 5 1 5 C~ 7 5 % ~ ~ b - S L - - 2 4 8 b = 5 2 ' 0 Ioc b + s = 1 0 8 ~ b - S L = 2 4 " 5 ~o I i i i i io 30 50 70 90 98 PROBABILITY ( cur e % ) Fig.5. Probai)ility plot of 166 Ni' s in soils, Hope, B.C., with 2 inflection points (indicated by arrowheads) suggesting it results from the combi nat i on of three lognormal populations in the ratio 5.5/19.5/75. A, B and C are the three partitioned populations estimated by lines through the calculated points (open circles). Parameters of each popul at i on are listed. Open triangles are check points that agree well with the original data (black dots). popul at i ons are given in Table III. On t he basis of the part i t i oned populations, a single threshold at 780 ppm Ni can be chosen to distinguish effectively between popul at i ons A and B. Populations B and C overlap somewhat and t wo thresholds must be chosen. These thresholds are arbitrarily t aken at t he 2 cumulative percentile of popul at i on C (i.e. 236 ppm Ni) and the 98 cumu- lative percentile of popul at i on B (i.e. 170 ppm Ni). TABLE III Estimated parameters of partitioned populations~Ni in soils, Hope area (southern British Columbia) Population Proportion No. of Values in ppm Ni (%) samples b b +s L b- s L A: anomalous 5.5 9 1170 1380 980 B: background 19.5 32 356 515 248 (ultramafic) C: background 75 125 52 108 24.5 (metaseds) A + B + C 100 166 140 These t hree t hreshol d values divide t he dat a into 4 groups, 3 of which each consist principally of a single popul at i on and a f our t h cont ai ni ng values f r om t wo popul at i ons (Table IV). The t hreshol ds can now be used as cont our values on a plan of t he grid, or can be used t o code dat a on a plan using col our or symbol s, t o aid in i nt erpret i ng t he significance of each popul at i on. In this case, popul at i on A is related t o Ni mi neral i zat i on and is t her ef or e i nt erpret ed as an anomal ous popul at i on. Popul at i on B corresponds t o areas underl ai n by ul t ramafi c rocks, and popul at i on C occurs in areas underl ai n by met asedi ment ar y rocks. TABLE IV Estimated thresholds, Ni in soils, Hope area (southern British Columbia) Threshold Principal content of group almost exclusively population A 78O almost exclusively population B 236 combination of populations B and C 170 almost exclusively population C The choi ce of t hreshol ds is arbi t rary. For exampl e, one coul d equally well have chosen t he t wo t hreshol ds for t he B and C popul at i ons at t he 1 and 99 cumul at i ve percent i l e of t he C and B popul at i ons respectively, or t he 2.5 and 97. 5 cumul at i ve percent i l e and so on. . . A choice should be made wi t h t he idea of defining a short range of overlap of t he t wo popul at i ons, and, at t he same time, produci ng adj acent ranges t hat t o all intents and purposes cont ai n values of a single popul at i on, wi t h negligible or mi nor amount s of ot her popul at i ons. Cu IN SOILS, SMITHERS AREA (BRITISH COLUMBIA) A probabi l i t y pl ot of 795 soil copper analyses is shown in Fig.6. The sinuous charact er of t he pl ot is pr obabl y real because of t he large number of values in t he dat a set. This t ype of dat a is characteristic of t he sort obt ai ned f r om reconnai ssance surveys where large quant i t i es of i nformat i on are obt ai ned in a relatively short time. The area sampled is underlain pr edomi nant l y by acid to i nt ermedi at e intrusive bodies t hat cut a t hi ck monot onous sequence of volcanic rocks. Infl ect i on poi nt s are evi dent at appr oxi mat el y t he 1, 2 and 32 cumul at i ve percent i l es indicating t he presence of at least four popul at i ons. These popula- tions can be est i mat ed by part i t i oni ng t he curve in stages. In this case, it is most conveni ent t o begin wi t h t he popul at i on C for whi ch most data poi nt s 141 I i b + s ~ 1 4 5 I I I ~ ' = ' ~ o o l J E ' ~c~ b-sL-- 9'6 0-5 I 0 30 50 70 90 95 PROBABI L I TY ( cur e. %) F i g . 6 . P r o b a b i l i t y p l o t o f 7 9 5 Cu ' s i n B - h o r i z o n s o i l s , S m i t h e r s a r e a , B. C. S y m b o l s a r e as defined for Fig.5. are available. Once C has been defined, popul at i on D can be estimated using C and the original dat a curve. These t wo populations can be specified reason- ably well. The upper t wo popul at i ons A and B can be approxi mat ed roughly but cannot be delineated with much precision because of t he small percentage of total dat a t hat each represents and hence the small number of points available for partitioning. Crude estimates of popul at i ons A and B are shown based on t he limited dat a available. A number of check points, shown as open triangles on t he curve were calculated for the part i t i oned popul at i ons A, B, C and D, combi ned in the ratio 1/ 1/ 30/ 68. These points agree almost perfectly with the smoot h curve describing t he data, suggesting t hat the partitioning represents a plausible model for the data. Estimated parameters of part i t i oned popul at i ons are listed in Table V. Comparison of the dat a with a geological map of the sampled area suggested t hat popul at i ons C and D represent background Cu in soils over volcanic and plutonic rocks, respectively. By the same means, it was concluded t hat popul at i ons A and B are anomal ous populations in areas underlain by volcanic and pl ut oni c rocks, respectively. In choosing thresholds for distinction between anomal ous and background values there is no need to consider either popul at i on A or D. The critical part of the graph is t he range of overlap of populations B and C. We know t hat about 2% of the data, or about 16 values are anomalous. Of these, 11 are above 100 ppm Cu as is 1 value of C popul at i on. Hence, one of 12 values above 100 ppm Cu is not anomal ous and 100 can be chosen as an arbitrary upper threshold. 142 TABLE V Estimated parameters of partitioned populations, Cu in soils, Smithers area (central British Columbia) Population Proportion No. of Val ues in ppm Cu (%) samples b b + s L b s b A 1 8 135 145 128 B 1 8 100 108 93 C 30 239 42.8 57.2 32.1 D 68 540 14.8 21.8 9.6 A + B + C + D 100 795 Virtually all of t he anomal ous popul at i on is above 85 ppm Cu. Thus, t he range 85--100 ppm Cu cont ai ns t he remaining 5 anomal ous values. This range also cont ai ns about 1.0% of t he C background popul at i on, about 2 values. Thus, t wo t hreshol ds are del i mi t ed t hat for all practical purposes defi ne all anomal ous values with a mi ni mum of background values represent ed. This exampl e illustrates several i mpor t ant poi nt s in procedure: (1) It is wise t o carry t hrough with a compl et e part i t i oni ng pr ocedur e in examining compl ex di st ri but i ons in order t o check the realism of t he inter- pret at i on. (2) Even when individual popul at i ons cannot be defi ned part i cul arl y accurat el y, t hreshol ds can commonl y be det er mi ned with adequat e accuracy. (3) Infl ect i on poi nt s in a probabi l i t y curve based on abundant dat a are pr obabl y real and should f or m a basis for i nt erpret at i on. (4) An alternative approach woul d have been t o group t he dat a i nt o t wo subclasses based on presence of underl yi ng volcanic or pl ut oni c rock. This pr ocedur e was not used here onl y because adequat e t hreshol ds coul d be obt ai ned wi t hout spending addi t i onal manpower in carrying out a mor e detailed analysis. (5) The bot t om popul at i on, D, is reasonabl y well known despite t he fact its part i t i oni ng was based on onl y t wo points. The foregoing exampl es show t hat t he maj or advantage of probabi l i t y plots is t o provi de a useful groupi ng of data. Commonl y, this grouping is not simply for t he purpose of obt ai ni ng t hreshol ds bet ween anomal ous and background popul at i ons -- but more generally t o derive t hreshol ds bet ween popul at i ons t hat aid in a general i nt erpret at i on of t he significance of t he data. pH MEASUREMENTS OF STREAMS pH measurement s are commonl y an integral part of stream sedi ment surveys. A probabi l i t y pl ot of pH values f r om one such survey in sout hern British Col umbi a is shown in Fig.7. The pl ot is on ari t hmet i c probabi l i t y paper -- a logarithmic t ransform being i ncorporat ed in t he original data 143 73 _ ~ . . . . . " ~ " " "--~ A 1 6 % ~ ' ~ - ~ - ~ I 7 =7 . 2 0 s=+O-IO 6 9 ~ , A: B: C 69: 15 6-7 ~ = 6 " 6 9 s = + O1 5 Q - , n 6.3 Z~ ~=588 ~ % 59 - s =+0. 21 2 I 0 50 50 70 90 98 PROBABI LI TY (cure. %) Fig.7. Probability plot of pH values obtained from a st ream sediment survey in southern British Columbia. Symbols are as defined for Fig.5. b e c a u s e of t h e v e r y n a t u r e o f p H val ues. A s mo o t h c ur ve t h r o u g h t he d a t a has t he f o r m o f a t r i mo d a l d i s t r i b u t i o n wi t h i nf l e c t i on p o i n t s a t t he 16 a nd 85 c u mu l a t i v e pe r c e nt i l e s . Th e c ur ve ha s b e e n p a r t i t i o n e d us i ng t he me t h o d de s c r i be d p r e v i o u s l y t o o b t a i n p o p u l a t i o n s A, B a nd C. Ch e c k p o i n t s ba s e d o n i deal mi x t u r e s o f t he t h r e e p o p u l a t i o n s in t he p r o p o r t i o n 1 6 / 6 9 / 1 5 agr ee r e ma r k a b l y wel l wi t h t he r eal d a t a cur ve. Th r e s h o l d s a r bi t r a r i l y c h o s e n a t t he 99 c u mu l a t i v e p e r c e n t i l e s o f A a nd B p o p u l a t i o n s , a n d t h e 1 c u mu l a t i v e pe r c e nt i l e s of t he B a nd C p o p u l a t i o n s , p r o v i d e t h e i n f o r ma t i o n i n Ta b l e VI . TABLE VI Estimated thresholds, pH values (southern British Columbia) pH % of total data principally population A 15 7.00 populations A + B 4.5 6.93 principally popul at i on B 64.5 6.37 6.36 6.35 principally popul at i on C 16 144 Thus, t he dat a can be divided i nt o four groups on t he basis of pH measure- ment s and pri or t o furt her analysis and i nt erpret at i on. Such a grouping coul d have f undament al significance in i nt erpret at i on of t race el ement dat a because of t he effect of pH on met al dispersion. WHOLE ROCK Cu, GUICHON BATHOLITH (CENTRAL BRITISH COLUMBIA) The Gui chon bat hol i t h has long been known as an i mpor t ant Cu-rich pl ut on in central British Col umbi a with several large por phyr y- t ype deposits either produci ng or nearing pr oduct i on at t he present time. An investigation of the whol e rock Cu cont ent of unmineralized samples scattered over t he bat hol i t h was under t aken by Brabec and involved an analysis of t he data using probabi l i t y graphs (Brabec and White, 1971). A probabi l i t y pl ot of t he t ot al data, some 330 analyses, coul d not be i nt erpret ed with confi dence. However, when data were grouped on t he basis of relative age and l i t hol ogy and each such group pl ot t ed separately, a realistic i nt erpret at i on became possible. Fig.8 cont ai ns probabi l i t y graphs of each of t he t hree groups, r epl ot t ed f r om dat a of Brabec and White (1971). The general similarity of shape of t he t hree curves suggests t hat t he grouping has fundament al significance. Each curve has the form of a bi modal di st ri but i on. In each case, however, t he bot t om part of t he bi modal curve is part l y missing due t o the bar interval chosen for const r uct i on of t he probabi l i t y plots (15 ppm Cu). Assuming t hat all di st ri but i ons are l ognormal it is possible t o part i t i on each curve using a modi fi cat i on of t he pr ocedur e described earlier. The upper popul at i on can be GRIUP I " - - GROUP TI N = I I 6 GROUP ITI N = I I 9 5 E \ \ \ IC I I I " I \ O 5 2 tO 30 50 70 90 9? PROBABI LI TY ( cum. %) F i g . 8 . P r o b a b i l i t y p l o t s o f w h o l e r o c k Cu ' s f o r 3 r o c k g r o u p s o f t h e G u i c h o n b a t h o l i t h , central British Columbia. Group I = youngest age, Group II = intermediate age and Group III= oldest age (after Brabec and White, 1971). 145 det er mi ned in t he normal manner. Points on t he l ower popul at i on are t hen calculated using t he expression: PM = f A P A + f BPB PM is read from t he dat a curve, fA and fB are known f r om t he posi t i on of t he i nfl ect i on poi nt and PA is read from t he part i t i oned popul at i on A. PB is t he onl y unknown and can be calculated and pl ot t ed for various ordi nat e levels. A line can t hen be passed t hrough these calculated poi nt s t o est i mat e popul at i on B. One exampl e is described in detail. The probabi l i t y pl ot for group II rocks is r epr oduced in Fig.9. Some difficulties were encount er ed in specifying an i nfl ect i on poi nt precisely, because t he t wo popul at i ons overlap t o a consider- able ext ent . However, a series of trial values were used until t he upper popu- lation pl ot t ed as a straight line, leading t o an i nfl ect i on being assigned at t he 80 cumul at i ve percentile. One addi t i onal probl em with t he data is a fl at t eni ng at t he upper end of t he curve. In fact, this fl at t eni ng is present t o some ext ent in plots for each of the 3 groups and is a characteristic pat t ern obt ai ned when a symmet ri c popul at i on has been t op-t runcat ed. Brabec and White (1971) arbitrarily rej ect ed a small pr opor t i on of high values f r om t hei r analysis t o impose this artificial t op t r uncat i on on t hei r data. Since the t r uncat ed values account for onl y about 2% of t he data, no ef f or t was made t o correct for t hei r absence. The upper ext remi t i es of all curves, however, were ignored during t he partitioning. .... ' ~ ~ ' I ~I=I16 I0C 8 \ )+s, I - ~ 0 x h L I I 0 30 50 70 90 98 99 PROBABI L I T Y (cure. %) Fig.9. Probability plot of 116 whole rock Cu's in Group II rocks (intermediate age) of the Guichon batholith, central British Columbia, showing partitioned populations and their parameters. Symbols are those defined for Fig.5. 1 4 6 O n c e t h e u p p e r p o p u l a t i o n A i s d e f i n e d , p o p u l a t i o n B c a n b e e s t i m a t e d us i ng t h e r e l a t i ons hi p: PM = f APA + f BPB as de s c r i be d ear l i er . Ch e c k p o i n t s o f i deal mi x t u r e s o f p a r t i t i o n e d p o p u l a t i o n s A a nd B, s h o wn as o p e n t r i angl es in Fi g. 9, c oi nc i de wi t h t he r eal d a t a c ur ve e x c e p t a t t h e u p p e r t r u n c a t e d end. Pa r a me t e r s o f t h e p a r t i t i o n e d p o p u l a t i o n f or e a c h o f t h e 3 g r o u p s ar e gi ven in Ta b l e VI I . TABLE VII Estimated parameters, whole rock Cu, Guichon batholith (central British Columbia) Lithologic Population Proportion No. of Values in ppm Cu group samples b b + s L b - s L I A 60 56 98 142 68 B 40 39 26.7 46.4 15.2 A + B 100 95 II A 80 93 69 139 34.5 B 20 23 10.9 20 5.9 A + B 100 116 III A 40 28 54 85 34.5 B 60 91 10.3 20.2 5.1 A + B 100 119 Fo r g r o u p A d a t a t h r e s h o l d s can be c h o s e n a r bi t r a r i l y as t h e 98 a nd 2 c u mu l a t i v e o f p o p u l a t i o n s A a n d B. Th e s e p e r c e n t a g e s c o r r e s p o n d t o 16. 5 a nd 39 p p m Cu, r e s p e c t i v e l y a n d di vi de t he d a t a i nt o 3 gr oups . An u p p e r g r o u p a b o v e 39 p p m Cu, cons i s t s o f 63% o f t he t o t a l d a t a a nd is es s ent i al l y o n l y A p o p u l a t i o n . A l o we r g r o u p b e l o w 16. 5 p p m Cu cons i s t s o f a b o u t 16% o f t he d a t a a n d f or al l pr a c t i c a l p u r p o s e s c o n t a i n o n l y B p o p u l a t i o n . Th e r e ma i n i n g 21% o f t he d a t a is a mi x t u r e o f A a nd B p o p u l a t i o n s i n t h e r a n g e b e t we e n t h e t wo t h r e s h o l d s . I n t hi s case, c o n s i d e r a b l e ove r l a p exi s t s b e t we e n t h e t wo p o p u l a t i o n s . Ne ve r t he l e s s , i t is pos s i bl e t o i d e n t i f y t he p o p u l a t i o n t o wh i c h mo s t o f t h e i ndi vi dual val ues b e l o n g a nd t hi s g r o u p i n g c o u l d ai d c o n s i d e r a b l y i n i n t e r p r e t a t i o n o f t h e s i gni f i cance o f e a c h p o p u l a t i o n THE IMPORTANCE OF ANALYTICAL PRECISION Th u s f ar , a n i mpl i c i t a s s u mp t i o n in t h e p r o c e d u r e f or e s t i ma t i n g t h r e s h o l d s is t h a t a na l yt i c a l val ues ar e knov, n pr e c i s e l y. I n pr a c t i c e , o f c our s e , r e c o r d e d val ues i nc l ude a c o mb i n e d s a mp l i n g a nd a na l yt i c a l e r r or . Co n s e q u e n t l y , s o me va l ue s a b o v e t he t h r e s h o l d a c t u a l l y b e l o n g b e l o w i t a nd vi ce ver sa. No r ma l l y t hi s c o n f u s i o n a f f e c t s o n l y a s mal l p r o p o r t i o n o f t he da t a , b u t b e c o me s mo r e a n d mo r e p r o n o u n c e d as t he pr e c i s i on b e c o me s p o o r e r a nd p o o r e r . 147 In some cases t he confusi on is minimal relative to the problem on hand and can be ignored. More generally, however, the sampling and analytical error should be t aken into account in defining thresholds. A convenient procedure to achieve this end is to consider the t hreshol d a range of values centred about t he single t hreshol d obt ai ned by assuming t hat values are perfect l y known. The t hreshol d range is a confidence belt based on the precision of the data. Average precision is normal l y adequate for defining such threshold ranges. Precision, however, does vary with absolute amount of the variable being estimated (e.g., Bolviken and Sinding-Larsen, 1973) and this can be t aken into account where adequat e dat a are available. Such t hreshol d ranges define narrow bands on cont our maps. This procedure i ncreases the number of pot ent i al l y anomal ous samples and t herefore involves additional t i me and money in checking such added samples. These efforts can be minimized by examining the geographic positions of t he additional samples relative to known anomal ous samples. DISCUSSION The met hod for choosing thresholds described here is a standardized t echni que applicable to t he vast quant i t y of geochemical data. It can be used for any pol ymodal distribution if sufficient dat a of adequat e quality are present so t hat partitioning is feasible. A grouping of the dat a values is obt ai ned t hat can be invaluable in interpretation. For this reason, t he met hod is more fundament al and pot ent i al l y more useful t han other met hods in common use. In particular, t he met hod outlined here stresses the concept t hat bot h background and anomal ous values represent populations t hat in many cases overlap (see Bolviken, 1971). The procedure is not restricted to t he choice of thresholds between anomal ous and background populations. It is much more general in nature, permitting grouping of many types of dat a with appropriate densi t y distribu- tions. In addi t i on, probability graph analysis of data is simple, rapid and amenable to use in t he field (see Lepeltier, 1969). Examples used to illustrate the selection of thresholds give ample evidence of the general usefulness of probability plots in dealing with geochemical data. This is t rue even if three or four populations are represented in the data, al t hough, in general, simpler interpretations result if data are first grouped on the basis of some fundament al physical or geological criterion. SUMMARY AND CONCLUSIONS (1) Geochemical analyses commonl y approxi mat e lognormal densi t y distribution sufficiently closely t hat the distributions can be represented usefully on lognormal probability paper. (2) Providing a dat a set contains adequat e values, normal l y a mi ni mum of about 100, a pol ymodal cumulative probability plot can be part i t i oned to 148 produce estimates of t he individual popul at i ons t hat make up t he t ot al di st ri but i on. (3) The part i t i oned popul at i ons can be used t o defi ne arbi t rary but meaningful t hreshol ds t hat divide t he dat a into groups t hat have f undament al significance. (4) In t he special case of no effective overlap bet ween anomal ous and background popul at i ons, a single t hreshol d can be defined. In t he common simple case of t wo overlapping anomal ous and background popul at i ons, t wo t hreshol ds are obt ai ned t hat divide t he data into t hree groups. An upper group of pr edomi nant l y anomal ous values, a central group of anomal ous and background values, and a t hi rd group of background values. (5) Pol ymodal di st ri but i ons of geochemical dat a consisting of more t han t wo popul at i ons can commonl y be t reat ed in t he same way as bi modal di st ri but i ons t o obt ai n useful t hreshol d values. In some cases, however, t he pr ocedur e can be simplified by grouping dat a on t he basis of some fundamen- tal characteristic (e.g., pH, underl yi ng rock t ype) t o pr oduce simpler probabi l i t y plots t hat permi t greater confi dence in part i t i oni ng and inter- preting. (6) The met hod described for choosing t hreshol ds is not confi ned t o t he di st i nct i on bet ween anomal ous and background values but has general appl i cat i on t o any t ype of data, providing t he individual popul at i ons approx- imate l ognormal (or normal ) densi t y di st ri but i on. For t unat el y, this cri t eri on is met in t he bulk geochemical data. ACKNOWLEDGEMENTS This paper is an out gr owt h of a more extensive st udy of t he use of probabi l i t y paper in dealing with various kinds of dat a obt ai ned f r om mineral expl orat i on programs. The st udy is suppor t ed by a grant f r om t he Depar t ment of Energy, Mines and Resources of Canada. Technical assistance t hrough much of t he st udy was given by Mr. A.C.L. Fox. Exampl es are drawn ent i rel y f r om real probl ems encount er ed in i ndust ry and in university research proj ect s Appreci at i on is expressed to t he numerous individuals and compani es involved for permission t o publish t hem. Dr. W.K. Fl et cher offered construc- tive criticism of an earlier draft of t he paper. REFERENCES Bianconi, F. and Saagar, R., 1971. Reconnaissance mineral exploration in the Yukon Territory, Canada. Schweiz. Mineral. Pet. Mitt., 51:139--154. Bolviken, B., 1971. A statistical approach to the problem of interpretation in geochemical prospecting. Can. Inst. Min. Metall., Spec. Vol., 11:564--567. Bolviken, B. and Sinding-Larson, R., 1973. Total error and other criteria in the interpreta- tion of stream sediment data. In: M.L. Jones (Editor), Geochemical Exploration, 1972. Institute for Mining and Metallurgy, London, pp.285--295. 149 Brabec, D. and White, W.H., 1971. Distribution of copper and zinc in rocks of the Guichon Creek batholith, British Columbia. Can. Inst. Min. Metall., Spec. Vol., 11:291--297. Cassie, R.M., 1954. Some uses of probability paper in the analysis of size frequency distributions. Aust. J. Mar. Freshwater Res., 5:513--523. Harding, J.P., 1949. The use of probability paper for the graphical analysis of polymodal frequency distributions. J. Mar. Biol. Assoc., U.K., 28:141--153. Hawkes, H.E. and Webb, J.S., 1962. Geochemistry in Mineral Exploration. Harper and Row, New York, N.Y., 415 pp. Koch, G.S., Jr. and Link, R.F., 1970. Statistical Analysis of Geological Data, Vol. I. John Wiley and Sons, New York, N.Y., 375 pp. Krumbein, W.C. and Graybill, F.A., 1965. An Int roduct i on to Statistical Models in Geology. McGraw-Hill, New York, N.Y., 475 pp. Lepeltier, C., 1969. A simplified statistical t reat ment of geochemical data by graphical representation. Econ. Geol., 64:538--550. Tennant , C.B. and White, M.L., 1959. Study of the distribution of some geochemical data. Econ. Geol., 54:1281--1290. Williams, X.K., 1967. Statistics in the interpretation of geochemical data. N. Z. J. Geol. Geophys., 10:771--797. Woodsworth, G.J., 1972. A geochemical drainage survey and its implications for metallogenesis, central Coast Mountains, British Columbia. Econ. Geol., 68 : 1104--1120