Professional Documents
Culture Documents
Published by Elsevier Science Ltd All rights reserved. Printed in Great Britain 0275-5408/00/$20.00 + 0.00 www.elsevier.com/locate/ophopt
PII: S0275-5408(99)00064-2
Statistical Review An introduction to analysis of variance (ANOVA) with special reference to data from clinical experiments in optometry
R. A. Armstrong, S. V. Slade and F. Eperjesi
Vision Sciences, Aston University, Birmingham B4 7ET, UK
Summary This article is aimed primarily at eye care practitioners who are undertaking advanced clinical research, and who wish to apply analysis of variance (ANOVA) to their data. ANOVA is a data analysis method of great utility and flexibility. This article describes why and how ANOVA was developed, the basic logic which underlies the method and the assumptions that the method makes for it to be validly applied to data from clinical experiments in optometry. The application of the method to the analysis of a simple data set is then described. In addition, the methods available for making planned comparisons between treatment means and for making post hoc tests are evaluated. The problem of determining the number of replicates or patients required in a given experimental situation is also discussed. 7 2000 The College of Optometrists. Published by Elsevier Science Ltd.
Introduction
This article is aimed primarily at eye care practitioners who are undertaking advanced clinical research, and who require a basic knowledge of the methods of analysis of variance (ANOVA) to analyse their experimental data. ANOVA is a data analysis method of great elegance, utility and exibility. It is the most eective method available for analysing the data from experiments. Computer software employing the methods of ANOVA is widely available to experimental scientists. The availability of this software, however, makes it essential that optometrists understand the basic principles of ANOVA. ANOVA is a method of great complexity and subtlety with many dierent
Received: 15 February 1999 Revised form: 6 September 1999 Correspondence and reprint requests to: R.A. Armstrong.
variations, each of which apply in a particular experimental context. Hence, it is possible to apply the wrong type of ANOVA and to draw the wrong conclusions from an experiment. This article describes rst, the origin of ANOVA, the logic which underlies the method and the assumptions necessary to apply it to data from experiments. Second, the application of the method to the analysis of a simple data set drawn from a clinical experiment in optometry is described. Third, the various methods available for making planned comparisons between the treatment means and post hoc tests are evaluated. Fourth, the problem of determining the number of replicates or patients in a given experimental context is discussed.
235
236
Ophthal. Physiol. Opt. 2000 20: No 3 null hypothesis that the means of the three treatments are identical.
untreated subject group (h0) while the other (h1) represents either a patient group representative of a clinical condition, or a group of subjects treated in a specic way, e.g., by being given a drug. In subsequent discussions these groups will be described as treatments. At the end of the experiment, a measurement (x ) is taken from each subject. To test the null hypothesis that there is no dierence between the two means, i.e., h0h1=0, a Student's t-test could be employed (Snedecor and Cochran, 1980). The statistic `t' is the ratio of the dierence between the two means and a measurement of the variation between the individual subjects pooled from both patient groups. A signicant value of Student's t indicates that the null hypothesis should be rejected and therefore, there is a signicant dierence between the treatment means. In theory, this method of analysis could be extended to the analysis of three or more dierent groups of subjects. An example of an experiment which employs three treatment groups, i.e., h0, h1, h2 is shown in Table 1. The objectives of this experiment might be to test the null hypotheses: (1) that the two treatment means (h1, h2) did not dier from the control (h0); and (2) that h1 and h2 did not dier from each other. To make these tests, three t-tests would be necessary (i.e., h0 vs h1, h0 vs h2 and h1 vs h2). However, there is a problem in making multiple comparisons between the means because not all of these comparisons can be made independently, e.g., if h0 > h1 and h0=h2 then it follows that h2 > h1. Hence, the comparison h2 vs h1 follows from the previous two tests and is not being tested independently. To overcome this problem, ANOVA was developed by Sir Ronald Fisher in the 1920s. ANOVA provides a single statistical test of the
Table 1. Calculation of sums of squares in a one-way analysis of variance (ANOVA) in a randomised design with three treatment groupsa Treatment groups Control (h0) 1 2 3 Totals Means xij Treatment 1 (h1) ah1 h1 Treatment 2 (h2) ah2 h2
ah0 h0 Total sums of squares x ij X 2 x 2 x ij 2 an ij Treatments sums of squares hi H 2 hi 2 aN x ij 2 an Error sums of squares Total sums of squares Treatments sums of squares
a
N = number of treatments; n = total number of observations, X =mean of the xij, H =mean of three treatment totals, Shi = individual treatment total.
ANOVAclinical experiments in optometry: R. A. Armstrong et al. total. In addition, there is variation between the replicate subjects within each treatment group. This variation is often called the residual or error variation because it describes the natural variation between experimental subjects. Variation between replicates within a treatment group is calculated as the sums of squares of the xij in each column from their column mean. The sums of squares calculated from each column are then added together to give the error sums of squares. In this simple case, however, there are only two sources of variation present, i.e., between treatments and error and the error sums of squares can be calculated by subtraction: Error sums of squares Total sums of squares Between treatments sums of squares If there are no signicant dierences between the means of the three treatments, the 9 observations are distributed about a common population mean `m'. If this is the case, then the variance (also called the mean square) calculated from the between treatments sums of squares and the error sums of squares should be estimates of the same quantity. Testing the dierence between these two mean squares is the basis of an ANOVA. The statistics are set out in an ANOVA table (Table 2). To compare the between treatments and error mean squares, the sums of squares are divided by the appropriate degrees of freedom (DF). Although dicult to prove, the DF of a quantity is often considered to be the number of observations minus the number of parameters estimated from the data which are required to calculate the quantity. Hence, the total and between treatments SS have 8 and 2 DF respectively, one less than the number of observations or groups. This is because the mean of the `xij' values and the mean of the three treatment totals were calculated from the data to obtain the sums of squares. The error sums of squares has 6 DF because the column means are used to calculate the sums of squares, i.e., there are 2 DF in each of the three columns making 6 in total. The between treatments mean square is then divided by the error mean square to
Table 2. Analysis of variance (ANOVA) table for a one-way randomised designa Variation Total Treatments Error
a
237
obtain the variance ratio. This statistic was named `F' (hence, `F-test') in honour of Fisher by G.W. Snedecor (Snedecor and Cochran, 1980). The value of `F' indicates the number of times the between groups mean square exceeds that of the error mean square. This value is taken to a table of the F-ratio to determine the probability of obtaining an `F' of this magnitude by chance, i.e., from data with no signicant dierences between the treatment means. The values given in a F-table are one-tail probabilities because an F-test determines whether the between treatments mean square is greater than the error mean square. If the value of `F' is equal to or greater than the value tabulated at the 5% level of probability, then the null hypothesis that the three treatment means are identical is rejected.
Example 1: A control (h0) and two treatments (h1, h2) Total treatments h0 vs (h1+h2)/2 h1 vs h2 Error (2) 1 1 6
DF 8 2 6
MS SS/2 SS/6
F MSTrts/MSError
Example 2: Treatments ordered in sequence Total treatments Linear effect Quadratic effect Error
a
SS=Sums of squares, DF=Degrees of freedom, MS=Mean square, F = Variance ratio, Trts=Treatments, Other abbreviations as in Table 1.
(2) 1 1 6
238
Ophthal. Physiol. Opt. 2000 20: No 3 uses the more conservative `Studentised range' rather than Student's t to determine a single critical value that all comparisons must exceed for signicance. This method can be used for experiments which have equal numbers of observations (r ) in each group or in cases where `r' varies signicantly between groups. However, with modest variations in `r', the SpjotvollStoline modication of the above method can be used (Spjotvoll and Stoline, 1973; Dunnett, 1980a). The Student NewmanKeuls (SNK) method makes all pairwise comparisons of the means ordered from the smallest to the largest using a stepwise procedure. First, the means furthest apart, i.e., `a' steps apart in the range, are tested. If this mean dierence is signicant, the means a-2, a-3, etc., steps apart are tested until a test produces a non-signicant mean dierence, after which the analysis is terminated. The SNK test is more liable to make a Type 2 rather than a Type 1 error. By contrast, the Tukey compromise method employs the average of the HSD and SNK critical values. Duncan's multiple range test is very similar to the SNK method, but is more liberal than SNK, the probability of making a Type 1 error increasing with the number of means analysed. One of the most popular methods used by experimenters is Schee's `S' test. This method makes all pairwise comparisons between the means and is a very robust procedure to violations of the assumptions associated with ANOVA. It is also the most conservative of the methods discussed giving maximum protection against making a Type 1 error. The GamesHowell method (Games and Howell, 1976) is one of the most robust of the newer methods available. It can be used in circumstances where `r' varies between groups, with heterogeneous variances and when normality cannot be assumed. This method denes a dierent critical value for each pairwise comparison and this is determined by the variances and numbers of observations in each group under comparison. Dunnett's test is used when several treatment means are each compared to a control mean. Equal or unequal `r' can be analysed and the method is not sensitive to heterogeneous variances (Dunnett, 1980b). An alternative to this test is the Bonferroni/Dunn method which can also be employed to test multiple comparisons between treatment means especially when a large number of treatments is present. Which of the above tests is actually used is a matter of personal taste. However, optometrists should be aware that each test addresses the statistical problems in a unique way. The present authors would rec ommend careful application of Schee's method for the multiple comparison between treatment means and the use of Dunnett's method when several treatments are being compared with a control mean. However, none of these methods is an eective substitute for an
comparison can then be tested against the error mean square using a F-test. Two examples of planned comparisons between three treatment means are shown in Table 3. If the three treatments comprise a control and two treatment groups, the between treatments sums of squares can be partitioned into: (1) a comparison of the control with the average of the two treatment means; and (2) a comparison between the two treatment means themselves. Alternatively, if the three treatments can be ordered in a sequence, e.g., measurements may be made at three sequential sample times or the treatments may represent dosages of a drug, the treatments sums of squares can be divided into portions associated with the `response curve'. For example, if there are signicant dierences between the treatment means then either the three means may lie on or close to a straight line (a signicant linear eect) or there is a signicant deviation from a linear eect (signicant quadratic eect). The sums of squares of the linear and quadratic eects can be calculated and each tested against the error. A similar approach can be adopted with four or more means. Post-hoc tests There may be circumstances in which tests between the treatment means are carried out post hoc or where multiple comparisons between the treatment means may be required. A variety of methods are available for making post-hoc tests. The most commonly used tests, based on a table given in the SuperANOVA manual (Abacus Concepts, 1989), are listed in Table 4. These tests determine the critical dierences that have to be exceeded by a pair of treatment means to be signicant. However, the individual tests vary in how eectively they address a particular statistical problem and their sensitivity to violations of the assumptions of ANOVA. The most critical problem is the possibility of making a Type 1 error, i.e., rejecting the null hypothesis when it is true. By contrast, a Type 2 error is accepting the null hypothesis when a real dierence is present. The post-hoc tests listed in Table 4 give varying degrees of protection against making a Type 1 error. Fisher's protected least signicant dierence (Fisher's PLSD) is the most liberal of the methods discussed and the most likely to result in a Type 1 error. All possible pairwise comparisons are evaluated and the method uses Students's t to determine the critical value to be exceeded for any pair of means based on the maximum number of steps between the smallest and largest mean. The TukeyKramer honestly signicant dierence (TukeyKramer HSD) is similar to the Fisher LSD but is less liable to result in a Type 1 error (Keselman and Rogan, 1978). In addition, the method
239
Most sensitive to Type 1 Less sensitive to Type 1 than Fisher PLSD As TukeyKramer Sensitive to Type 2 Average of Tukey and SNK More sensitive to Type 1 than SNK Most conservative More conservative than majority More conservative than majority Conservative
experiment designed specically to make planned comparisons between the treatment means.
A worked example
An example of a 1-way ANOVA utilising Schee's post-hoc test is shown in Table 5. The data are drawn from an optometric experiment designed to compare the reading rates of young normal subjects, elderly normal subjects and subjects with age-related macular degeneration (ARMD). The data (xij ) comprise the number of correct words read in a minute by each experimental subject. The F-test is signicant at the P < 0.01 level of probability suggesting the null hypothesis that there are no dierences between the treatment means should be rejected. Application of Schee's test to all pairwise comparisons between the means suggests that the young and elderly normals each have signicantly higher reading rates than the ARMD group. In addition, reading rates in the young and elderly normal patients are similar.
Error control
All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons All pairwise comparisons Compare treatments with control All pairwise comparisons and treatments with control
Assumptions of ANOVA
ANOVA makes assumptions about the nature of the experimental data which have to be at least approximately true before the method can be validly applied. An observed value xij can be considered to be the sum of three parts: (1) the overall mean of the observations (m ); (2) a treatment or class deviation; and (3) a random element drawn from a normally distributed population. The random element reects the combined eects of natural variation between subjects and errors of measurement. ANOVA assumes rst, that these errors are normally distributed with a zero mean and standard deviation `s' and second, that although the means may vary from group to group, the variance is constant in all groups. Failure of an assumption aects both the signicance levels and the sensitivity of the Ftests. Experiments are usually too small to test whether these assumptions are likely to be true. In many biological and medical applications, in which a quantity is being measured, however, the assumptions hold well (Cochran and Cox, 1957; Ridgman, 1975). If there is doubt about the validity of the assumptions, signicance levels and condence limits must be considered to be approximate rather than exact. If there is only a single signicant gure the assumptions are more doubtful and if the data are small whole numbers then the assumptions are unlikely to hold. If the assumptions do not hold, then a transformation of the `xij' into another scale will often allow an ANOVA to be carried out. For example, in many instances, a logarithmic transformation of the data will restore equal variances (Snedecor and Cochran, 1980).
Fisher PLSD TukeyKramer HSD SpjotvollStoline StudentNewmanKeuls (SNK) TukeyCompromise Duncan's Multiple Range Test Scheffe's S Games/Howell Dunnett's test Bonferroni
240
Ophthal. Physiol. Opt. 2000 20: No 3 each group. R is a useful quantity to calculate after the experiment because it indicates the ability of the experiment to detect a dierence between two treatment means. For example, if R = 10%, then a true dierence between the treatments smaller that 10% is unlikely to be detected by the experiment and a type 2 error could result. Another advantage of this formula is that `C' is related fairly closely to the experimental context and can therefore be estimated from similar experiments in the literature. For example, if r = 9 and C = 5%, the experiment would have a high probability of detecting a 5% dierence between the means. If this is the magnitude of dierence the experimenter wishes to detect then it would be worth carrying out the experiment with 9 replicates per group. If a dierence smaller than 5% is expected then the replication would have to be increased. To estimate the number of replicates required in a given experiment to have a high probability of detecting a particular percent dierence between the means `R' with a coecient of variation of the experimental material `C', then the equation can be rearranged as follows: p r 2C 2aR2 Application of this formula will demonstrate that detection of small dierences requires considerable replication while small numbers of replicates detect only large dierences. An experiment with insucient
Table 5. A worked example of a 1-way ANOVA with comparisons between the means using Scheffe's test. Data are the reading rates (number of correct words per minute) of three groups of patientsa Young normal 65.0 44.6 83.8 73.4 76.4 100.0 91.2 78.7 75.7 55.7 Mean Standard error Analysis of variance Source of variation Total Between treatments Error 74.45 5.14 Elderly normal 50.8 46.2 79.6 112.4 75.3 85.6 84.6 98.9 38.2 77.3 74.89 7.43 Age-related macular degeneration (ARMD) 47.5 26.9 39.9 48.8 53.3 46.1 58.6 52.8 41.1 75.0 49.00 4.01
DF 29 2 27
a Scheffe's test: Young normals vs Elderly normals S=0.001 (P > 0.05), Young normals vs ARMD S=4.97 (P < 0.05), Elderly normals vs ARMD S=5.15 (P < 0.05).
ANOVAclinical experiments in optometry: R. A. Armstrong et al. replication may be useless because it may dispute the hypothesis even if the hypothesis is true.
241
Conclusions
The key to the correct application of ANOVA in optometric research is careful experimental design. The following points should therefore, be considered before designing any experiment (Cochran and Cox, 1957; Jeers, 1978): 1. The objectives of the experiment should be explicitly stated and translated into precise questions or hypotheses that the experiment can be expected to answer. 2. The experimental treatments need to be dened precisely so that they can be applied correctly by the experimenter or by those wishing to repeat the experiment. 3. If the `treatments' consist of subjects or patients, they must be representative of a particular population, i.e., chosen at random. 4. The inclusion of a naturally dened control treatment or subject group should be considered. 5. If there are preliminary estimates (R ) of the precision likely to be achieved by the experiment and the level of natural variation (C ) present, then this information should be used to estimate the level of replication (r ) needed in the experiment. 6. Where possible, planned comparisons between the treatment means should be dened in advance of the inspection of the rst results of the experiment. Where this is not possible, the advantages and dis-
advantages of the available `post-hoc' tests should be carefully considered. 7. If there is any doubt about the above issues, the optometrist should seek advice from a qualied statistician with experience of optometric research before carrying out the experiment. Once committed to a particular design, there may be little a statistician can do to help.
References
Abacus Concepts (1989). SuperANOVA. Abacus Concepts Inc, Berkeley, CA. Cochran, W. G. and Cox, G. M. (1957). Experimental Designs, 2nd ed. Wiley, New York. Dunnett, C. W. (1980a). Pairwise multiple comparisons in the homogeneous variance, unequal sample size case. J. Am. Stat. Assoc. 75, 789795. Dunnett, C. W. (1980b). Pairwise multiple comparisons in the unequal variance case. J. Am. Stat. Assoc. 75, 796800. Games, P. A. and Howell, J. F. (1976). Pairwise multiple comparison procedures with unequal n's and/or variances: A Monte Carlo Study. J. Educ. Stat. 1, 113125. Jeers, J. N. R. (1978). Design of experiments. In: Statistical Checklist 1, Insititute of Terrestrial Ecology, Southhampton. Keselman, H. J. and Rogan, J. C. (1978). A comparison of the modied Tukey and Schee methods of multiple comparisons for pairwise contrasts. J. Am. Stat. Assoc. 73, 47 51. Ridgman, W. J. (1975). Experimentation in Biology. Blackie, London. Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, 7th ed. Iowa State University Press, Ames. Spjotvoll, E. and Stoline, M. R. (1973). An extension of the T-method of multiple comparisons to include cases with unequal sample sizes. J. Am. Stat. Assoc. 69, 975979.