You are on page 1of 18

A Beginner’s Guide to Statistical Testing

© Denton Bramwell, Nov 15, 1999


Promontory Management Group Inc., denton@pmg.cc

Statistics was invented as an occupation in order to provide employment for those who do
not have enough personality to go into accounting. I guess that explains why I took up
Six Sigma, with its concomitant involvement in statistics.

In this short paper, I’ll explain in the simplest way I can the practical fundamentals of a
few key statistical tests. The statistical software that I use is Minitab version 12. [Note:
Version 13 has since been released.]

Data Types

There are four types of data. It is important to know which kind of data you are dealing
with, since this determines the types of test you can do. The four types are
• Nominal or categorical: Classes things into categories, such as pass/fail, old/new,
red/blue/green, or died/survived.
• Ordinal: Ranks things. Team B is stronger than C, which, is in turn stronger than
A.
• Interval: Things you can add and subtract, like degrees Celsius or Fahrenheit.
Interval data can be continuous or discrete. Continuous data can take on any
value, such as the actual air pressure inside a tire. Discrete data comes in steps,
like money. In the US, the smallest increment of money is one cent, so
expressions of money normally come in steps of one cent.
• Ratio: Similar to interval data, but zero indicates a total absence of a property.
Temperature expressed in Kelvins is ratio data. Like interval data, ratio data may
be continuous or discrete.

Nominal data yields the least information. Ordinal data is better. Interval or ratio data is
best.

Suppose you have been told that you will be taking a long trip, to 10 different
destinations, somewhere in the world. If you are told only that five are in the northern
hemisphere, and that five are in the southern hemisphere, you have no idea what kind of
clothes to pack. That is because you have been given categorical data, and categorical
data conveys the least information of any of the data types. On the other hand, if you are
given the exact coordinates of your 10 destinations, you will know exactly what to pack.
Expressing positions on the Earth’s surface requires ratio data, and ratio data conveys the
most information. If possible, use interval or ratio data in conducting your investigations.

Basics of Hypothesis Testing

The null hypothesis, Ho , is the dull hypothesis: Nothing interesting is going on. For
example, it makes no difference whether people receive an antibiotic or a placebo.

1
The alternate hypothesis, Ha, is that something interesting is happening: People who
receive vitamin C live longer than those who do not.

When setting up a test, it helps to formally write down your null and alternate
hypotheses. You may even want to use the same batch of data to test more than one
alternative hypothesis.

You actually never prove or disprove either the null or alternative hypothesis. You just
accept one and reject the other, based on what is most likely.

For industrial situations, and for exploratory medical testing, we generally look for a P
value of .05 or less in order to reject the null hypothesis. The rule is that if P is low, Ho
must go. If we obtain a P value of .03, we would say that we would have a 3% chance of
being wrong if we reject the null hypothesis, or say that there is a difference between
treatments. For tests where we want to do something that will affect the health of many
people, we would want a much stronger P value.

You can really only make two decision errors. These are imaginatively named Type 1
and Type 2 errors. The risk of making a Type 1 error is designated α , and the risk of
making a Type 2 error is designated β .

To make a Type 1 error, you say that a difference exists when none does, that is, interpret
noise as signal. To make a Type 2 error, you say that no difference exists when, in fact,
one does, that is, interpret signal as noise.

In medical testing, a Type 1 error would be claiming that vitamin C cures baldness, when,
in fact, it does not, at least as far as we know. A Type 2 error would be failing to detect
that minoxidil will enhance hair growth. The P value most statistical tests report is your
alpha risk, or probability of making a Type 1 error.

Of course, there are errors you can make other than decision errors. One important one to
avoid is sampling error. This comes about when our test is performed on subjects that are
not representative of the total population. Experimenting on med students is one good
example. Space prohibits a full exploration of this topic.

You should also understand and apply the idea of “inference space”. Inference space is
the mathematical region in which you have tested the variables, and know how they
behave. This is the region you can legitimately make statements about. Everything else,
I call “outer space”.

Another thing well worth remembering is that randomization is the best insurance you
can get for the money. It tends to protect you against hidden variables.

2
The Chi Square Test

This is a test using nominal data. It makes very few assumptions, so it can be used where
other tests are useless. The bad news is that it can require a lot of data, and the results are
sometimes hard to interpret.

This test is performed on data organized into rows and columns. For example, we might
want to test survival of patients who are given one of three types of care in a life
threatening situation. The data array might look like

Treatment 1 Treatment 2 Treatment 3


Survived 21 40 19
Died 9 2 11

The null hypothesis of the Chi Square test is that the rows and columns are statistically
independent. That is, if I know which column a case is in, I cannot make much better
than a random guess as to which row it is from, or vice versa.

Let’s now go to Minitab and perform the test. Enter the data table, and choose
STATS|TABLES|CHI SQUARE. You will obtain this report:

1 21 40 19 80
23.53 32.94 23.53

2 9 2 11 22
6.47 9.06 6.47

Total 30 42 30 102

Chi-Sq = 0.272 + 1.513 + 0.872 +


0.989 + 5.500 + 3.171 = 12.316
DF = 2, P-Value = 0.002

You can see that our original input is printed, with row and column sums. Beneath each
of the original entries is an “expected value” for each cell. These are the numbers 23.53,
32.94, and so on. If any of these values is below 5, it means that you have not collected
enough data to run this test, and cannot depend on the result. In this case, our lowest cell
has an expected value of 6.47, so we are “good to go”.

The interpretation of this test lies in the P value, which is our α risk, or probability of
being wrong if we assert there is a difference. It says that we would run a .2% chance of
being wrong if we asserted that the treatments produced different survival rates. That’s a
strong outcome. The weakness of the test is that all you really statistically know is that
they are not the same. You don’t officially know which treatment is better or worse.

3
The Tukey Tail Count

The B vs. C, or Tukey Tail Test is extremely easy to use, and reasonably powerful. It is
nonparametric. That means that it does not depend on the data being normally
distributed, so you can use this test when you have really “ugly” data, and still get
“beautiful” results. Another good feature of the B vs. C test is that strong statistical
indications can be found with very small amounts of data. This test uses ordinal data.

C represents one process that we want to evaluate, usually the “current process” that we
want to replace. B represents the other process, usually the “better process” that we want
to put in place. John Tukey is the statistician that developed and popularized the test that
B vs. C is based on. The data analysis is based on counting B’s and C’s on the ends, or
“tails” of a distribution. That’s where the name Tukey Tail Test comes from.

This test absolutely requires randomization and/or blocking. It is tempting to just run a
few C’s, then switch over and run a few B’s, and make a decision. This is an invitation to
error. You must either block or randomize such variables as age, gender, or other factors
that might influence the outcome.

Suppose that you randomly select just 3 items from your C process, and 3 items from
your B process, and that all 3 of your B’s were better than your C’s. Intuitively, you’d
probably feel that you were on to something, and that your B process was indeed better.
Statistics support your intuition.

There are just 20 different orders you can put 3 B’s and 3 C’s into (Try it!). Only one of
those is 3 B’s above 3 C’s. You have just one chance in 20, or .05 probability of arriving
at such an arrangement by chance. So you if all 3 B’s are better than all 3 C’s, you can
be 95% confident that B is indeed better than C, and you have arrived at this with only
six samples.

One other nicety is that this works for cosmetic issues. You don’t really have to be able
measure “goodness”. You just have to be able to rank your B’s and C’s from highest to
lowest. So if you want to test the attractiveness of men/women in the fashion modeling
business vs. the attractiveness of men/women in med school, this is your test, assuming
you can put all of them in rank order of attractiveness.

The following table shows the number of B’s and C’s you need, for a given level of α
risk. Note that for each level of α, there are several acceptable combinations of B and C
sample sizes. Remember, you must randomize your selections. Also remember, all your
B’s must outrank all your C’s. Note that the table includes our 3 B vs. 3 C example.

4
α Risk Number of B Number of C
samples samples
2 43
3 16
.001 4 10
5 8
6 6
2 13
.01 3 7
4 5
5 4
1 19
.05 2 5
3 3
4 3
1 9
.1 2 3
3 2

There is another variation on this test that is useful, and in many cases preferred to the
system already shown. In this system, all the B’s do not need to outrank all the C’s. It
does require that the number of B samples and the number of C samples must be
approximately equal. If the ratio between the size of the B sample and the C sample falls
in the range of 3:4 to 4:3, the sample sizes are near enough to being equal. This test also
absolutely requires randomization.

Suppose that you draw a random sample of 6 B’s and 6 C’s, and that you rank them in
order of “goodness”. Your distribution might look like this:

Best B
B Pure B’s. This is your
“B end count.”
B
B
C Mixed B’s and C’s.
B Disregard.
C
C
B
C
C Pure C’s. This is your
“C end count.”
C
Worst

5
Now just add your B end count to your C end count, and refer to this chart for your level
of significance.

α Risk B+C End Count


At Least
.1 6
.05 7
.01 10
.001 13

Since our total end count is 7, we can be 95% confident that B is better than C.

The Two Sample T Test

This is a powerful test, using interval or ratio data. Its null hypothesis is that the means
of two normally distributed populations are the same. The alternate hypothesis can be
that they are not equal, or that the mean of A is greater/less than the mean of B.

This test requires that fundamental assumptions are met.

• The data are normally distributed. A normal distribution is also called a Gaussian
distribution. Actua lly the test is pretty robust, and will still give good results with
pretty non- normal data.
• The data are stable during the sampling period. This assumption is very
important.

You must demonstrate that you meet these assumptions if you want to cruise down the
Gaussian superhighway. Fortunately, this is easy.

I Chart for DATA1


120
3.0SL=117.6

110
Individual Value

100 X=100.1

90

- 3.0SL=82.61
80
0 5 10 15 20 25
Observation Number

First, let’s check stability for DATA1. There are several ways to do this, but my favorite
is the Individuals Control Chart. This is run by clicking STATS|CONTROL

6
CHARTS|INDIVIDUALS. The basic thing we’re doing is verifying that there are no
trends or sudden, major shifts or trends in the data (stability). This chart spreads the data
out in time, and allows us to inspect it. DATA1 is just fine.

DATA2 is not so fine. I see an upward trend in the first dozen data points, followed by a
sudden downward shift, then a sudden upward shift at about data point 19. This data
does not meet our assumption of stability.

I Chart for DATA2

130 3.0SL=129.0

120
Individual Value

110
X=106.6
100

90
-3.0S L=84.29
80
1

0 5 10 15 20 25
Observation Number

I Chart for DATA3


30
3.0SL=25.3 6
20
Individual Value

10

0 X=0.6667

-10

-20
-3.0S L=-24.03
-30
0 5 10 15
Observation Number

DATA3 has a discrimination problem. The only values that the data can take on are –10,
-5, 0, 5, and 10. Since the data can only exist in one of five states, it will not pass a
normality test. I’ve researched the matter, and have not been able to get an accurate
statement of how much discrimination is enough. My opinion is that you should see at
least 10 different values in the data, and there is some reasoning behind this opinion,
which we don’t have space to discuss here.

We have now concluded that DATA2 and DATA3 are not prime candidates for the
Gaussian./normal model. This does not mean that we absolutely cannot apply the model,
but it does mean that if we do apply it, we use the results with appropriate caution.

7
DATA1 has met both of our tests so far, and requires only one more test. That is for
normality. This test is done by clicking STAT|BASIC STATS|NORMALITY TEST.
DATA1 produces this result:
Normal Probability Plot

.9 99
.99
.95
Probability
.80
.50
.20
.05
.01
.0 01

90 100 110
DATA1
Average: 100.082 Anderson-Darling Normality Test
St Dev: 5.26480 A-Squared: 0.152
N: 24 P-Value: 0.953

The first test is to simply look at the data points and see if they fall close to the red line.
How close is close enough? We generally use the rule that if it can be covered by a fat
pencil, it is good enough. In fact, if you have fewer than 15-20 data points, this is the
only test you need apply. P values are not too revealing for small samples. However, if
you have many points, pay close attention to the P value. If it is .05 or more, you have no
reason to believe the data is non- normal.

Another way to look for normality is to do STAT| BASIC STATS|DISPLAY


DESCRIPTIVE STATISTICS, and under GRAPHS, check GRAPHICAL SUMMARY.
That will produce this result:

Descriptive Statistics
Variable: DATA1

Anderson-Darling Normality Test


A-Squared: 0. 152
P-Value: 0. 953

M ean 100. 082


St Dev 5. 265
Variance 27. 7181
Skewness 0.341344
Kurtosis 9.98E-04
N 24
90 95 100 105 110
M inimum 90. 854
1st Quartile 96. 666
M edian 100. 344
3rd Quartile 103. 928
95% Confidence Int erval for Mu M aximum 112. 583
95% Confidence Int erval for Mu
97.859 102. 306
97.5 98.5 99.5 100.5 101.5 102.5 95% Confidence Int erval for Sigm a
4.092 7. 385
95% Confidence Interval for Median
95% Confidence Int erval for Median
97.551 101. 980

Note that we get our same normality P value, .953, and that the computer draws its best
estimate of a normal curve that fits the data histogram. Don’t be shocked if your data

8
looks a lot more ragged than this, but still tests normal. Small samples can look pretty
Raggedy Andy, and still truly be normal.

Using the normal distribution superhighway has its distinct advantages. The toll is that
you should give at least a little attention to checking the assumptions. In practice, you
need not be overly concerned about normality if your samples are reasonably large. It is
not necessary for the “distribut ion of data” to be normal. It is only necessary that the
“distribution of differences” be normal, and, with decent size samples, that will be the
case.

Let us assume that we have been collecting data on a weight loss product. Two randomly
selected groups of volunteers are to be tested. Group A receives a placebo, and group B
receives Factor L. Both groups are weighed at the beginning of the study, and 90 days
later, and the change in weight is calculated. For each volunteer we have a number that
represents weight gain. A “gain” of –8 would be a loss of 8 pounds. We have tested the
data for normality, and stability, and it is satisfactory. We select as our null hypothesis:
Weight gain does not depend on whether the volunteer received the placebo or Factor L.
Our alternate hypothesis is that the mean gain for the control group is greater than the
mean gain for the test group.

The test is performed by invoking STATS|BASIC STATS|2-SAMPLE T.

At this point, we must confess that we have slipped a new idea into the mix. We have
developed a lot of our ideas based on the Gaussian distribution. It turns out there is
another distribution that is practically identical to the Gaussian distribution for samples
larger than 30. However, this other distributio n is more accurate than the Gaussian for
smaller sample sizes. This is the Student’s T distribution. We almost always use it, since
it applies everywhere the Gaussian distribution does, and some places that it does not.

Assume that my control group is in the column CONTROL, and that the test group is in
the column FACTORL. You would then choose STAT|BASIC STATS|2-SAMPLE T.
From the dialog window choose SAMPLES IN DIFFERENT COLUMNS, and indicate
CONTROL as the first column and FACTORL as the second column. As your
ALTERNATIVE, indicate GREATER THAN. This may seem a little backwards, but we
are hoping the weight gains in the first column are greater than the weight gains in the
second column. Minitab is not specific about telling you how to do this, so you must
know that it assumes the form COLUMN 1 [>,<, NOT EQUAL] COLUMN 2.

From this, we obtain the following:

Two Sample T-Test and Confidence Interval

Two sample T for CONTROL vs FACTORL

N Mean StDev SE Mean


CONTROL 24 1.01 3.19 0.65

9
FACTORL 24 -2.61 4.36 0.89

95% CI for mu CONTROL - mu FACTORL: ( 1.40, 5.85)


T-Test mu CONTROL = mu FACTORL (vs >): T = 3.29 P = 0.0010 DF = 42

Boxplots of CONTROL and FACTORL


(means are indicated by solid circles)

10

-5

-10

-15

CO NTRO L FACTO RL

The interpretation of this is that our control group of 24 volunteers gained an average of
1.01 pounds, and that the standard deviation of the group’s gain was 3.19 pounds. Our
test volunteers gained an average of –2.61 pounds (2.61 pound loss), and the standard
deviation of the group’s gain was 4.36 pounds. If we assert that the gain of the control
group is greater than the gain of the test group, there is a .0010 chance that we will be
wrong. We therefore reject the null hypothesis, and accept the alternative hypothesis.
The box plot gives a nice visual to demonstrate what the statistics tell us.

It is by no means necessary to have equal sample sizes.

We did not check the ASSUME EQUAL VARIANCES box, because we have not taken
the step of demonstrating that the standard deviations (and hence the variances) of the
two groups are approximately equal. If you want to run TEST OF EQUAL
VARIANCES on your data, you can earn the right to check this box. However, doing
this essentially relieves the computer of some of its calculation duties, and some expense
of your time, with negligible difference in results. If you are feeling compassionate
toward your overworked computer, you may want to take this route. I suggest just
leaving the box unchecked.

The default alternative hypothesis is NOT EQUAL. We chose to do a one sided test,
GREATER THAN, because it always gives the test more power. If we had chosen NOT
EQUAL, our P value would have been “only” .0020. In this case, we would still make
the same decision, but, still, we have doubled our P value. Frequently, this will be of
great importance. If the test results are interesting only if the results come out in one
particular direction, then use the one sided test. That is most of the time.

10
ANOVA

ANOVA stands for Analysis of Variance. It is very much like the 2 Sample T Test, but
instead of having just two groups, you can have multiple groups. This is extremely
handy. The null hypothesis of ANOVA is that all the groups have the same mean. Also,
it assumes that all the groups have roughly the same standard deviation. Before running
ANOVA, you should do Test of Equal Variances to ensure this. If LeVene’s test comes
out with a P of .05 or more, you’re generally “good to go”.

Giving ANOVA such a short discussion shouldn’t be interpreted as indicating that it is


less useful. It is an extremely useful tool, but beyond the scope of a short paper. If you
understand the T Test, you should be able to move into simple ANOVA without much
difficulty.

Power and Sample Size

DO NOT BEGIN AN EXPERIMENT THAT YOU ARE GOING TO ANALYZE WITH


A T TEST OR A 2K FACTORIAL WITHOUT UNDERSTANDING THIS SECTION!

Risk, sample size, and the size of change you are trying to detect are three factors that are
eternally at odds. It is rather like the old engineering maxim, “Good, fast, cheap. Pick
any two.” You can run a test for a small change, using a small sample size, but your risk
of error will be high. You can run a test with low risk, trying to detect a small change,
but your sample size will be large.

A powerful test is one that is not highly influenced by random factors. The mathematical
definition is 1-β. It is the probability that you will detect a difference, if, in fact, it exists.

If you properly use this function of Minitab, you will be able to predict the approximate
chances of success of your experiment. You would be amazed how many tests are run
that have too few samples to have any reasonable chance of success. Conversely, you
would also be amazed at how many tests are much more expensive than necessary,
because they have far too many samples.

If someone wants you to run a test that you can demonstrate has a power of .5 or less, I
suggest that you give them a quarter, and tell them to make their decision based on
whether it comes up heads or tails. If your test is no better than a quarter, don’t bother
running it. Get on to something that does have a good chance of success.

Let’s suppose that you know that “normal” for Factor K in the general population
includes values from 90 to 140. Ben assures me that the medical standard for “normal” is
95% of the population. That’s convenient, because 95% of the population occurs from 2
standard deviations below the mean to 2 standard deviations above the mean. We know,
then, that 140 to 90 is four standard deviations, and can quickly deduce that one standard
deviation is about 12.5.

11
Now suppose that we have a new treatment that we think will beneficially increase Factor
K by 15 points. Go to STAT|POWER AND SAMPLE SIZE. Choose 2-SAMPLE T,
then CALCULATE SAMPLE SIZE FOR EACH POWER VALUE. In POWER
VALUES, enter .95 .9 .87. In DIFFERENCE enter 15. In SIGMA, enter 12.5. Click
OPTIONS, and choose GREATER THAN as your alternative hypothesis. Minitab will
return this:

2-Sample t Test

Testing mean 1 = mean 2 (versus >)


Calculating power for mean 1 = mean 2 + 15
Alpha = 0.05 Sigma = 12.5

Sample Target Actual


Size Power Power
16 0.9500 0.9527
13 0.9000 0.9077
12 0.8700 0.8854

In this case, if we want a power of at least .95, we need two samples of 16 each. For a
power of .9, we need only 13 in each sample, and if we are willing to settle for only .87,
we need 12 in each sample.

Similarly, we can calculate our chances of success if sample size is dictated by budget or
the ever innumerate pointy haired boss.

2K Factorial Experiments

These are wonderful.

When you to a T Test, or a single-factor ANOVA, you are changing one variable, and
studying the effect on your test subjects. This is a “one factor at a time” or OFAT
experiment.

OFAT experiments are fine, but you have to realize that any factor that you don’t account
for shows up as noise, which makes your distributions fatter, and makes your effect
harder to see. That is, cholesterol levels depend a number of factors, such as diet, level of
exercise, and heredity. In an OFAT, you randomize to make sure that these variables do
indeed show up as noise. Would it not be better to make them part of the experiment?
Then they can be made into “signal” instead of noise.

A 2K experiment allows you to do this. It also allows you to detect interactions between
the variables. The lovely thing is that you can do this for four or five variables, and still
not have to take many more observations.

12
For example, you can test two treatments simultaneously, and account for any differences
due to gender, and you can test for all the interactions, all for the price of a T Test.

To set up your test, choose two to five variables you want to test. More than this can get
a little messy. Choose a “high” and a “low” value for each variable. It is better if these
are interval values, but they can be categorical/nominal. For example, if I want to test the
effect of ascorbic acid and aspirin on acidity, I might choose zero as my low dose for
both, and 500 mg as my high dose for both. I might want to test subject age as my other
variable. I could pick 20-25 as the low value for my age variable and 55-60 as the high
value.

My output will be some very scientifically arrived at acidity scale, which extends from 0
to 100.

For a simple demonstration, let’s do this with a single replicate, and no center points.
We’ll add these embellishments in class.

In Minitab, go to STAT|DOE|CREATE FACTORIAL DESIGN. Choose 2 LEVEL


FACTORIAL, with 3 as your number of variables. Under DESIGNS, choose FULL
FACTORIAL, with 0 center points, 1 replicate, and 1 block. Click OK, then go to
FACTORS, and in the NAME column, put ascorbic acid next to A, aspirin next to
B and age next to C. Your low value will automatically be coded as a –1, and your high
value will automatically be coded as a 1. Yes, you can change this, but wait until you
understand a little better.

Click OK, and OK again, and Minitab will design your experiment for you. It will
automatically randomize your run order for you. (Actually, with a single replicate,
randomization buys you nothing, so we would have been justified in not randomizing this
time. However, we will shortly drop a variable, effectively creating a second replicate,
and, at that point, randomization starts to have some use.) Your design will look different
from mine, because your random number generator spat out a different sequence than
mine did---well, we hope so, or they are not so random. Anyway, mine looks like this:

StdOrder RunOrde CenterPt Blocks ascorbic aspirin age acidity


r acid
2 1 1 1 1 -1 -1
3 2 1 1 -1 1 -1
8 3 1 1 1 1 1
1 4 1 1 -1 -1 -1
4 5 1 1 1 1 -1
6 6 1 1 1 -1 1
7 7 1 1 -1 1 1
5 8 1 1 -1 -1 1

This instructs me to find a young subject, give him a high dose of ascorbic acid, a low
dose of aspirin, and note the resulting acidity. I then continue on down the list. One

13
missing data point is not necessarily the kiss of death when you have multiple replicates
(sets of data), but you do want to try very hard to get complete sets.

In my table, I created a made- up set of results, and Minitab obligingly spits out the
following interesting output:

Pareto Chart of the Effects


(respon se i s acid ity, A lp ha = .1 0)

A: aspi ri n
B: ascorbi c
A C: age

AB

BC

AC

AB C

0 10 20 30 40 50

Fractional Factorial Fit

Estimated Effects and Coefficients for acidity (coded units)

Term Effect Coef


Constant 43.7500
aspirin 57.5000 28.7500
ascorbic 27.5000 13.7500
age 10.0000 5.0000
aspirin*ascorbic 7.5000 3.7500
aspirin*age 0.0000 0.0000
ascorbic*age -0.0000 -0.0000
aspirin*ascorbic*age 0.0000 0.0000
Analysis of Variance for acidity (coded units)

Source DF Seq SS Adj SS Adj MS F P


Main Effects 3 8325.00 8325.00 2775.00 * *
2-Way Interactions 3 112.50 112.50 37.50 * *
3-Way Interactions 1 0.00 0.00 0.00 * *
Residual Error 0 0.00 0.00 0.00
Total 7 8437.50

The interpretation of this output is that going from the low state of aspirin to the high
state accounts for 57.5 units of the observed change in acidity. Similarly, ascorbic acid is

14
responsible for 27.5 units of change. Age, and an interaction between aspirin and
ascorbic acid account for lesser amounts of change, and the model predicts that these last
two are not statistically significant. We shall see more about this as we add replicates.

Since age is not indicated as a significant variable, we can gain more information about
the important variables by eliminating this unimportant one.

By going to STAT|DOE|ANALYZE FACTORIAL DESIGN|TERMS, we can eliminate


all terms except aspirin and ascorbic acid. Since we have only 2 main terms, only 4 runs
are required per replicate. But note that we still have 8 observations. The computer will
recognize this, and turn our 4 extra observations into a replicate, giving us two replicates.
The output now becomes quite a bit stronger.

Minitab does do a quick shuffle on us, without telling us---with a single replicate, it
reports the magnitude of the effect. When we get two replicates, it reports a T value.
The output looks like this:

Pareto Chart of the Standardized Effects


(response is acidity, Alpha = .10)

aspirin

ascorbic

0 5 10

Estimated Effects and Coefficients for acidity (coded units)

Term Effect Coef StDev Coef T P


Constant 43.75 2.795 15.65 0.000
aspirin 57.50 28.75 2.795 10.29 0.000
ascorbic 27.50 13.75 2.795 4.92 0.004

Analysis of Variance for acidity (coded units)

Source DF Seq SS Adj SS Adj MS F P


Main Effects 2 8125.0 8125.0 4062.50 65.00 0.000
Residual Error 5 312.5 312.5 62.50
Lack of Fit 1 112.5 112.5 112.50 2.25 0.208
Pure Error 4 200.0 200.0 50.00

15
Total 7 8437.5

We now get F and P values, and note that we still have about the same estimates for the
constant (average acidity with no treatment), aspirin, and ascorbic acid. We also see that
our model accounts for 8125/8437.5 of the variation, with 312.5/8437.5 of the variation
being attributed to noise (variables we did not account for).

We have less than one chance per thousand (.001) of being wrong if we assert that aspirin
and ascorbic acid increase acidity. We also know the magnitude of change that each
variable causes. That’s a pretty substantial model, for only 8 observations.

Even if we cannot reduce the model by dropping one variable, a simple 8 observation
experiment for three variables can be very revealing, especially if we are in the screening
stages.

The output of the test assumes a linear model. If we want to test the validity of this
assumption, we can add center points. Three per replicate is pretty normal.

Using Power and Sample Size, we find that if we want to be .95 sure of catching a 15
point shift in a process that normally runs with a standard deviation of 15, and run no
more than a .05 alpha risk, we need 7 replicates, or 56 total observations. For 56
observations, we get full evaluation of three variables and all their interactions. A T Test
to detect a single variable on this level requires 54 observations. Two more observations
give us a lot more information. A four variable experiment with these objectives requires
only 64 observations. Actually, the factorial experiment is much more likely to succeed
than the T Test, besides being much more informative.

Why should this be so? Suppose we are testing blood cholesterol levels, and our
variables are Medicine X, gender, and exercise level. If we test Medicine X vs. placebo
on the general population, we will have relatively broad distributions, since we randomly
select males, females, couch potatoes and athletes to ensure that we represent the whole
population. In a 2K, our distributions will have relatively smaller standard deviations,
since we will be considering more restricted groups, made up of males who exercise,
male couch potatoes, female athletes, and female couch potatoes. These are more select
groups. We have taken some of the variation out, and made it into variables which we
test. Skinnier distributions make it easier to detect smaller changes, so our chances of
success are better than they would be with a T Test.

To demonstrate the effect of replicates, the same data was copied into an experiment with
the same variables, and 7 replicates instead of 1. Random numbers with a mean of 0 and
a standard deviation of 10 were added to the data, to simulate the effects of unaccounted
for variables and measurement system error. Our results are:

16
Pareto Chart of the Standardized Effects
(re sp on se i s a ci d2 , Al ph a = .10)

A: a sp iri n
B: a scorbi c
A C: a ge

AB

A BC

BC

AC

0 10 20

Fractional Factorial Fit

Estimated Effects and Coefficients for acid2 (coded units)

Term Effect Coef StDev Coef T P


Constant 42.2642 1.337 31.62 0.000
aspirin 59.4971 29.7485 1.337 22.26 0.000
ascorbic 28.5346 14.2673 1.337 10.67 0.000
age 14.8242 7.4121 1.337 5.55 0.000
aspirin*ascorbic 14.2761 7.1380 1.337 5.34 0.000
aspirin*age -0.0245 -0.0123 1.337 -0.01 0.993
ascorbic*age 0.5717 0.2859 1.337 0.21 0.832
aspirin*ascorbic*age 1.5459 0.7730 1.337 0.58 0.566

Analysis of Variance for acid2 (coded units)

Source DF Seq SS Adj SS Adj MS F P


Main Effects 3 64034.4 64034.4 21344.8 213.34 0.000
2-Way Interactions 3 2857.9 2857.9 952.6 9.52 0.000
3-Way Interactions 1 acid2 33.5
33.5 33.5 0.33 0.566
Residual Error 48Residuals
Normal Plot of 4802.5 4802.5 I Chart 100.1
of R esiduals
Pure Error 48Interaction Plot (data means)
4802.5 4802.5 for acid2
100.1
40
Total 20 55 71728.2 30 3.0SL=30.76

20
Unusual10Observations for acid2 as pirin
Residual

10
Residual

95 0 -1 X=0.000
Obs 0 acid2 Fit StDev Fit Residual St Resid
22
-10
85 1
11 34.105 12.708 3.781 21.398
-20 2.31R
44 -10
103.573 84.959 3.781 18.614 2.01R
75 -30 -3.0SL=-30.76
55 47.981 26.582 3.781 21.399
-40 2.31R
65-2 -1 0 1 2 0 10 20 30 40 50 60
R denotes an Normal Score
55 observation with a large standardized residual
Observatio n Number
Mean

45Histogram of Residuals Residual s vs. Fits


35 20
10 25
10
Frequency

Residual

15
5 0
5

-1 -10
1 17
0
-15 -10 -5 0 5 10 15 ascorbic
20 aci 0 10 20 30 40 5 0 60 70 80 90 100
With 7 replicates, we can “see more clearly”, and all the variables have statistical
significance. We can also see that there is a significant interaction between aspirin and
ascorbic acid. This is different from the two producing a simple additive result. The
combination of aspirin and ascorbic acid being in their high state produces a stronger
result than just adding the two results together.

We have also performed an analysis of our residuals. This is something you should
always do when you have replicates. We want to see residuals that are normally
distributed, time stable, and that are fairly equally spaced about zero in the residuals vs.
fits graph. These residuals aren’t perfect, but they are quite adequate.

We have nicely recovered our factors, too, even in the face of a fair amount of deliberate
statistical noise. The recovered factors correspond well with the ones used to generate
the outcomes.

This class of experimental designs has some tremendous benefits:

1. It is vastly more efficient that single factor designs. For the price of a T test,
which provides information on a single variable, we get five variables and all their
possible interactions. For six or more variables, it becomes more efficient to use
advanced versions of this design, which are beyond the scope of this paper.
2. Single factor designs will discover interactions only with the greatest of labor and
successful intuition. 2K designs discover and characterize them with great ease.
3. 2K designs are more likely to discover effects than simple T tests are. Factors that
cloud a T test with noise can be made into input variables in this type of test, and
that removes their ability to conceal effects.

Promontory Management Group, Inc.


www.pmg.cc
801 710 5645
This document provided free of charge, for the use
of the person who requested it. It may not be
reproduced, distributed, or incorporated in a
course of study without written permission.

18

You might also like