You are on page 1of 7

Ophthal. Physiol. Opt. Vol. 20, No. 3, pp. 235241, 2000 7 2000 The College of Optometrists.

Published by Elsevier Science Ltd All rights reserved. Printed in Great Britain 0275-5408/00/$20.00 + 0.00 www.elsevier.com/locate/ophopt

PII: S0275-5408(99)00064-2

Statistical Review An introduction to analysis of variance (ANOVA) with special reference to data from clinical experiments in optometry
R. A. Armstrong, S. V. Slade and F. Eperjesi
Vision Sciences, Aston University, Birmingham B4 7ET, UK

Summary This article is aimed primarily at eye care practitioners who are undertaking advanced clinical research, and who wish to apply analysis of variance (ANOVA) to their data. ANOVA is a data analysis method of great utility and flexibility. This article describes why and how ANOVA was developed, the basic logic which underlies the method and the assumptions that the method makes for it to be validly applied to data from clinical experiments in optometry. The application of the method to the analysis of a simple data set is then described. In addition, the methods available for making planned comparisons between treatment means and for making post hoc tests are evaluated. The problem of determining the number of replicates or patients required in a given experimental situation is also discussed. 7 2000 The College of Optometrists. Published by Elsevier Science Ltd.

Introduction
This article is aimed primarily at eye care practitioners who are undertaking advanced clinical research, and who require a basic knowledge of the methods of analysis of variance (ANOVA) to analyse their experimental data. ANOVA is a data analysis method of great elegance, utility and exibility. It is the most eective method available for analysing the data from experiments. Computer software employing the methods of ANOVA is widely available to experimental scientists. The availability of this software, however, makes it essential that optometrists understand the basic principles of ANOVA. ANOVA is a method of great complexity and subtlety with many dierent
Received: 15 February 1999 Revised form: 6 September 1999 Correspondence and reprint requests to: R.A. Armstrong.

variations, each of which apply in a particular experimental context. Hence, it is possible to apply the wrong type of ANOVA and to draw the wrong conclusions from an experiment. This article describes rst, the origin of ANOVA, the logic which underlies the method and the assumptions necessary to apply it to data from experiments. Second, the application of the method to the analysis of a simple data set drawn from a clinical experiment in optometry is described. Third, the various methods available for making planned comparisons between the treatment means and post hoc tests are evaluated. Fourth, the problem of determining the number of replicates or patients in a given experimental context is discussed.

The origin of ANOVA


Consider an experiment involving two groups of subjects. One group may comprise a control or

235

236

Ophthal. Physiol. Opt. 2000 20: No 3 null hypothesis that the means of the three treatments are identical.

untreated subject group (h0) while the other (h1) represents either a patient group representative of a clinical condition, or a group of subjects treated in a specic way, e.g., by being given a drug. In subsequent discussions these groups will be described as treatments. At the end of the experiment, a measurement (x ) is taken from each subject. To test the null hypothesis that there is no dierence between the two means, i.e., h0h1=0, a Student's t-test could be employed (Snedecor and Cochran, 1980). The statistic `t' is the ratio of the dierence between the two means and a measurement of the variation between the individual subjects pooled from both patient groups. A signicant value of Student's t indicates that the null hypothesis should be rejected and therefore, there is a signicant dierence between the treatment means. In theory, this method of analysis could be extended to the analysis of three or more dierent groups of subjects. An example of an experiment which employs three treatment groups, i.e., h0, h1, h2 is shown in Table 1. The objectives of this experiment might be to test the null hypotheses: (1) that the two treatment means (h1, h2) did not dier from the control (h0); and (2) that h1 and h2 did not dier from each other. To make these tests, three t-tests would be necessary (i.e., h0 vs h1, h0 vs h2 and h1 vs h2). However, there is a problem in making multiple comparisons between the means because not all of these comparisons can be made independently, e.g., if h0 > h1 and h0=h2 then it follows that h2 > h1. Hence, the comparison h2 vs h1 follows from the previous two tests and is not being tested independently. To overcome this problem, ANOVA was developed by Sir Ronald Fisher in the 1920s. ANOVA provides a single statistical test of the

The logic of ANOVA


Consider the experiment with three treatment groups illustrated in Table 1. There are three subjects within each group randomly selected from the subject populations, i.e., the experiment is replicated three times. This type of experimental design is often described as a one-way ANOVA in a randomised design. The xij are the results of the experiment, ah0, ah1 and ah2 are the sums of the xij in each column (the treatment totals), and h0, h1 and h2 the treatment means. In an ANOVA, the total variation between the xij is calculated and then partitioned into portions associated with dierences between the treatment means and the variation between the patients within treatment groups. The sum of squares of the deviations of the xij from their mean (sums of squares) is used as a measure of total variation, i.e., Total sums of squares x ij X 2 where `X' is the overall mean. If there is no signicant dierence between the treatment means, there will be no signicant variation of these means from their overall mean. Hence, the sums of squares of the three treatment means from their overall mean is a measure of the treatment eect. This variation, the sums of squares of treatments, is usually calculated from the column totals, i.e., Between treatments sums of squares Shi H 2 where H is the overall mean and Shi a treatment

Table 1. Calculation of sums of squares in a one-way analysis of variance (ANOVA) in a randomised design with three treatment groupsa Treatment groups Control (h0) 1 2 3 Totals Means xij Treatment 1 (h1) ah1 h1 Treatment 2 (h2) ah2 h2

ah0 h0 Total sums of squares x ij X 2 x 2 x ij 2 an ij Treatments sums of squares hi H 2 hi 2 aN x ij 2 an Error sums of squares Total sums of squares Treatments sums of squares
a

N = number of treatments; n = total number of observations, X =mean of the xij, H =mean of three treatment totals, Shi = individual treatment total.

ANOVAclinical experiments in optometry: R. A. Armstrong et al. total. In addition, there is variation between the replicate subjects within each treatment group. This variation is often called the residual or error variation because it describes the natural variation between experimental subjects. Variation between replicates within a treatment group is calculated as the sums of squares of the xij in each column from their column mean. The sums of squares calculated from each column are then added together to give the error sums of squares. In this simple case, however, there are only two sources of variation present, i.e., between treatments and error and the error sums of squares can be calculated by subtraction: Error sums of squares Total sums of squares Between treatments sums of squares If there are no signicant dierences between the means of the three treatments, the 9 observations are distributed about a common population mean `m'. If this is the case, then the variance (also called the mean square) calculated from the between treatments sums of squares and the error sums of squares should be estimates of the same quantity. Testing the dierence between these two mean squares is the basis of an ANOVA. The statistics are set out in an ANOVA table (Table 2). To compare the between treatments and error mean squares, the sums of squares are divided by the appropriate degrees of freedom (DF). Although dicult to prove, the DF of a quantity is often considered to be the number of observations minus the number of parameters estimated from the data which are required to calculate the quantity. Hence, the total and between treatments SS have 8 and 2 DF respectively, one less than the number of observations or groups. This is because the mean of the `xij' values and the mean of the three treatment totals were calculated from the data to obtain the sums of squares. The error sums of squares has 6 DF because the column means are used to calculate the sums of squares, i.e., there are 2 DF in each of the three columns making 6 in total. The between treatments mean square is then divided by the error mean square to
Table 2. Analysis of variance (ANOVA) table for a one-way randomised designa Variation Total Treatments Error
a

237

obtain the variance ratio. This statistic was named `F' (hence, `F-test') in honour of Fisher by G.W. Snedecor (Snedecor and Cochran, 1980). The value of `F' indicates the number of times the between groups mean square exceeds that of the error mean square. This value is taken to a table of the F-ratio to determine the probability of obtaining an `F' of this magnitude by chance, i.e., from data with no signicant dierences between the treatment means. The values given in a F-table are one-tail probabilities because an F-test determines whether the between treatments mean square is greater than the error mean square. If the value of `F' is equal to or greater than the value tabulated at the 5% level of probability, then the null hypothesis that the three treatment means are identical is rejected.

Comparison of group means


Planned comparisons The F-test of the treatment means is only the rst stage of the data analysis. The next step involves the examination of the dierences between the treatment means. Specic comparisons may have been planned before the experiment was carried out, decided after the data have been collected (`post-hoc') or comparisons between all possible combinations of the treatment means may be made. In some circumstances, the experiment is designed to test specic dierences between the treatment means. This can be achieved by dividing the between treatments sums of squares into portions associated with specic comparisons among the treatment means. In general, the number of planned comparisons that can be made is one less than the number of treatments in the experiment. The mean square for each planned
Table 3. Examples of planned comparisons between the treatment means with three treatment groupsa Variation SS DF MS F

Example 1: A control (h0) and two treatments (h1, h2) Total treatments h0 vs (h1+h2)/2 h1 vs h2 Error (2) 1 1 6

SS a(xijX )2 a(aTiH )2 Total SS-Trts SS

DF 8 2 6

MS SS/2 SS/6

F MSTrts/MSError

Example 2: Treatments ordered in sequence Total treatments Linear effect Quadratic effect Error
a

SS=Sums of squares, DF=Degrees of freedom, MS=Mean square, F = Variance ratio, Trts=Treatments, Other abbreviations as in Table 1.

(2) 1 1 6

Abbreviations as in Tables 1 and 2.

238

Ophthal. Physiol. Opt. 2000 20: No 3 uses the more conservative `Studentised range' rather than Student's t to determine a single critical value that all comparisons must exceed for signicance. This method can be used for experiments which have equal numbers of observations (r ) in each group or in cases where `r' varies signicantly between groups. However, with modest variations in `r', the SpjotvollStoline modication of the above method can be used (Spjotvoll and Stoline, 1973; Dunnett, 1980a). The Student NewmanKeuls (SNK) method makes all pairwise comparisons of the means ordered from the smallest to the largest using a stepwise procedure. First, the means furthest apart, i.e., `a' steps apart in the range, are tested. If this mean dierence is signicant, the means a-2, a-3, etc., steps apart are tested until a test produces a non-signicant mean dierence, after which the analysis is terminated. The SNK test is more liable to make a Type 2 rather than a Type 1 error. By contrast, the Tukey compromise method employs the average of the HSD and SNK critical values. Duncan's multiple range test is very similar to the SNK method, but is more liberal than SNK, the probability of making a Type 1 error increasing with the number of means analysed. One of the most popular methods used by experimenters is Schee's `S' test. This method makes all pairwise comparisons between the means and is a very robust procedure to violations of the assumptions associated with ANOVA. It is also the most conservative of the methods discussed giving maximum protection against making a Type 1 error. The GamesHowell method (Games and Howell, 1976) is one of the most robust of the newer methods available. It can be used in circumstances where `r' varies between groups, with heterogeneous variances and when normality cannot be assumed. This method denes a dierent critical value for each pairwise comparison and this is determined by the variances and numbers of observations in each group under comparison. Dunnett's test is used when several treatment means are each compared to a control mean. Equal or unequal `r' can be analysed and the method is not sensitive to heterogeneous variances (Dunnett, 1980b). An alternative to this test is the Bonferroni/Dunn method which can also be employed to test multiple comparisons between treatment means especially when a large number of treatments is present. Which of the above tests is actually used is a matter of personal taste. However, optometrists should be aware that each test addresses the statistical problems in a unique way. The present authors would rec ommend careful application of Schee's method for the multiple comparison between treatment means and the use of Dunnett's method when several treatments are being compared with a control mean. However, none of these methods is an eective substitute for an

comparison can then be tested against the error mean square using a F-test. Two examples of planned comparisons between three treatment means are shown in Table 3. If the three treatments comprise a control and two treatment groups, the between treatments sums of squares can be partitioned into: (1) a comparison of the control with the average of the two treatment means; and (2) a comparison between the two treatment means themselves. Alternatively, if the three treatments can be ordered in a sequence, e.g., measurements may be made at three sequential sample times or the treatments may represent dosages of a drug, the treatments sums of squares can be divided into portions associated with the `response curve'. For example, if there are signicant dierences between the treatment means then either the three means may lie on or close to a straight line (a signicant linear eect) or there is a signicant deviation from a linear eect (signicant quadratic eect). The sums of squares of the linear and quadratic eects can be calculated and each tested against the error. A similar approach can be adopted with four or more means. Post-hoc tests There may be circumstances in which tests between the treatment means are carried out post hoc or where multiple comparisons between the treatment means may be required. A variety of methods are available for making post-hoc tests. The most commonly used tests, based on a table given in the SuperANOVA manual (Abacus Concepts, 1989), are listed in Table 4. These tests determine the critical dierences that have to be exceeded by a pair of treatment means to be signicant. However, the individual tests vary in how eectively they address a particular statistical problem and their sensitivity to violations of the assumptions of ANOVA. The most critical problem is the possibility of making a Type 1 error, i.e., rejecting the null hypothesis when it is true. By contrast, a Type 2 error is accepting the null hypothesis when a real dierence is present. The post-hoc tests listed in Table 4 give varying degrees of protection against making a Type 1 error. Fisher's protected least signicant dierence (Fisher's PLSD) is the most liberal of the methods discussed and the most likely to result in a Type 1 error. All possible pairwise comparisons are evaluated and the method uses Students's t to determine the critical value to be exceeded for any pair of means based on the maximum number of steps between the smallest and largest mean. The TukeyKramer honestly signicant dierence (TukeyKramer HSD) is similar to the Fisher LSD but is less liable to result in a Type 1 error (Keselman and Rogan, 1978). In addition, the method

ANOVAclinical experiments in optometry: R. A. Armstrong et al.


a PLSD=Protected least significant difference, HSD=Honestly significant difference. Column 2 indicates whether equal numbers of replicates (r ) in each treatment group are required or whether the method can be applied to cases with unequal `r'. Column 3 indicates whether a significant between treatments F ratio is required and columns 4 and 5 whether the method assumes equal variances in the different patient groups and normality of errors respectively. The final column indicates the degree of protection against type 1 and type 2 errors.

239

Most sensitive to Type 1 Less sensitive to Type 1 than Fisher PLSD As TukeyKramer Sensitive to Type 2 Average of Tukey and SNK More sensitive to Type 1 than SNK Most conservative More conservative than majority More conservative than majority Conservative

experiment designed specically to make planned comparisons between the treatment means.

A worked example
An example of a 1-way ANOVA utilising Schee's post-hoc test is shown in Table 5. The data are drawn from an optometric experiment designed to compare the reading rates of young normal subjects, elderly normal subjects and subjects with age-related macular degeneration (ARMD). The data (xij ) comprise the number of correct words read in a minute by each experimental subject. The F-test is signicant at the P < 0.01 level of probability suggesting the null hypothesis that there are no dierences between the treatment means should be rejected. Application of Schee's test to all pairwise comparisons between the means suggests that the young and elderly normals each have signicantly higher reading rates than the ARMD group. In addition, reading rates in the young and elderly normal patients are similar.

Error control

All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons Compare treatments with control All pairwise comparisons and treatments with control

Assumptions of ANOVA
ANOVA makes assumptions about the nature of the experimental data which have to be at least approximately true before the method can be validly applied. An observed value xij can be considered to be the sum of three parts: (1) the overall mean of the observations (m ); (2) a treatment or class deviation; and (3) a random element drawn from a normally distributed population. The random element reects the combined eects of natural variation between subjects and errors of measurement. ANOVA assumes rst, that these errors are normally distributed with a zero mean and standard deviation `s' and second, that although the means may vary from group to group, the variance is constant in all groups. Failure of an assumption aects both the signicance levels and the sensitivity of the Ftests. Experiments are usually too small to test whether these assumptions are likely to be true. In many biological and medical applications, in which a quantity is being measured, however, the assumptions hold well (Cochran and Cox, 1957; Ridgman, 1975). If there is doubt about the validity of the assumptions, signicance levels and condence limits must be considered to be approximate rather than exact. If there is only a single signicant gure the assumptions are more doubtful and if the data are small whole numbers then the assumptions are unlikely to hold. If the assumptions do not hold, then a transformation of the `xij' into another scale will often allow an ANOVA to be carried out. For example, in many instances, a logarithmic transformation of the data will restore equal variances (Snedecor and Cochran, 1980).

Table 4. Methods of making `post-hoc' multiple comparisons between meansa

Use Equal MS Equal r Method F N

Fisher PLSD TukeyKramer HSD SpjotvollStoline StudentNewmanKeuls (SNK) TukeyCompromise Duncan's Multiple Range Test Scheffe's S Games/Howell Dunnett's test Bonferroni

Yes No No Yes No No No No No Yes

Yes No No Yes No No Yes Yes No No

Yes Yes Yes Yes Yes Yes No No No Yes

Yes Yes Yes Yes Yes Yes No No Yes Yes

240

Ophthal. Physiol. Opt. 2000 20: No 3 each group. R is a useful quantity to calculate after the experiment because it indicates the ability of the experiment to detect a dierence between two treatment means. For example, if R = 10%, then a true dierence between the treatments smaller that 10% is unlikely to be detected by the experiment and a type 2 error could result. Another advantage of this formula is that `C' is related fairly closely to the experimental context and can therefore be estimated from similar experiments in the literature. For example, if r = 9 and C = 5%, the experiment would have a high probability of detecting a 5% dierence between the means. If this is the magnitude of dierence the experimenter wishes to detect then it would be worth carrying out the experiment with 9 replicates per group. If a dierence smaller than 5% is expected then the replication would have to be increased. To estimate the number of replicates required in a given experiment to have a high probability of detecting a particular percent dierence between the means `R' with a coecient of variation of the experimental material `C', then the equation can be rearranged as follows: p r 2C 2aR2 Application of this formula will demonstrate that detection of small dierences requires considerable replication while small numbers of replicates detect only large dierences. An experiment with insucient

The number of replications in an experiment


Increasing replication of an experiment decreases the error associated with the dierence between two treatment means provided that replicates are drawn at random. This is because the errors associated with any treatment tend to cancel out as the number of replicates is increased. A question frequently asked is how many replicates or patients would be appropriate to use in a given experimental situation? It is dicult to provide a precise answer to this question. The probability of obtaining a signicant result in an experiment is dependent on the standard error per experimental unit, the number of replicates and the number of DF available for testing the treatment eects. In practice, however, the size of an experiment is often restricted by lack of time and resources. In such circumstances, it is good practice to estimate the degree of precision that is attained in a particular experimental context. A useful approximate formula is given by Ridgman (1975), i.e., p p R 2C 2a r where R is the percentage dierence detectable in an experiment, i.e., the dierence between a treatment and a control mean expressed as the percentage of the mean of the whole experiment, `C' is the coecient of variation (the standard deviation as a percentage of the mean) and `r' is the number of replicate patients in

Table 5. A worked example of a 1-way ANOVA with comparisons between the means using Scheffe's test. Data are the reading rates (number of correct words per minute) of three groups of patientsa Young normal 65.0 44.6 83.8 73.4 76.4 100.0 91.2 78.7 75.7 55.7 Mean Standard error Analysis of variance Source of variation Total Between treatments Error 74.45 5.14 Elderly normal 50.8 46.2 79.6 112.4 75.3 85.6 84.6 98.9 38.2 77.3 74.89 7.43 Age-related macular degeneration (ARMD) 47.5 26.9 39.9 48.8 53.3 46.1 58.6 52.8 41.1 75.0 49.00 4.01

Sums of squares 13186.815 4393.961 8792.854

DF 29 2 27

Mean square 2196.98 325.661

F-test 6.746 P < 0.01

a Scheffe's test: Young normals vs Elderly normals S=0.001 (P > 0.05), Young normals vs ARMD S=4.97 (P < 0.05), Elderly normals vs ARMD S=5.15 (P < 0.05).

ANOVAclinical experiments in optometry: R. A. Armstrong et al. replication may be useless because it may dispute the hypothesis even if the hypothesis is true.

241

Conclusions
The key to the correct application of ANOVA in optometric research is careful experimental design. The following points should therefore, be considered before designing any experiment (Cochran and Cox, 1957; Jeers, 1978): 1. The objectives of the experiment should be explicitly stated and translated into precise questions or hypotheses that the experiment can be expected to answer. 2. The experimental treatments need to be dened precisely so that they can be applied correctly by the experimenter or by those wishing to repeat the experiment. 3. If the `treatments' consist of subjects or patients, they must be representative of a particular population, i.e., chosen at random. 4. The inclusion of a naturally dened control treatment or subject group should be considered. 5. If there are preliminary estimates (R ) of the precision likely to be achieved by the experiment and the level of natural variation (C ) present, then this information should be used to estimate the level of replication (r ) needed in the experiment. 6. Where possible, planned comparisons between the treatment means should be dened in advance of the inspection of the rst results of the experiment. Where this is not possible, the advantages and dis-

advantages of the available `post-hoc' tests should be carefully considered. 7. If there is any doubt about the above issues, the optometrist should seek advice from a qualied statistician with experience of optometric research before carrying out the experiment. Once committed to a particular design, there may be little a statistician can do to help.

References
Abacus Concepts (1989). SuperANOVA. Abacus Concepts Inc, Berkeley, CA. Cochran, W. G. and Cox, G. M. (1957). Experimental Designs, 2nd ed. Wiley, New York. Dunnett, C. W. (1980a). Pairwise multiple comparisons in the homogeneous variance, unequal sample size case. J. Am. Stat. Assoc. 75, 789795. Dunnett, C. W. (1980b). Pairwise multiple comparisons in the unequal variance case. J. Am. Stat. Assoc. 75, 796800. Games, P. A. and Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances: A Monte Carlo Study. J. Educ. Stat. 1, 113125. Jeers, J. N. R. (1978). Design of experiments. In: Statistical Checklist 1, Insititute of Terrestrial Ecology, Southhampton. Keselman, H. J. and Rogan, J. C. (1978). A comparison of the modied Tukey and Schee methods of multiple comparisons for pairwise contrasts. J. Am. Stat. Assoc. 73, 47 51. Ridgman, W. J. (1975). Experimentation in Biology. Blackie, London. Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, 7th ed. Iowa State University Press, Ames. Spjotvoll, E. and Stoline, M. R. (1973). An extension of the T-method of multiple comparisons to include cases with unequal sample sizes. J. Am. Stat. Assoc. 69, 975979.

You might also like