You are on page 1of 15

STATGRAPHICS Rev.

7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 1
)
Box-Cox Transformations

Summary
The Box-Cox Transformations procedure is designed to determine an optimal transformation
for Y while fitting a linear regression model. It is useful when the variability of Y changes as a
function of X. Often, an appropriate transformation of Y both stabilizes the variance and makes
the deviations around the model more normally distributed.

The class of transformations considered are the power transformations defined by

(
1
2

+ = ' Y Y (1)

in which the data is raised to a power
1
after shifting it a certain amount
2
. Often, the shift
parameter
2
is set equal to 0. This class includes square roots, logarithms, reciprocals, and other
common transformations, depending on the power. Examples include:

Power Transformation Description

1
=2
2
Y Y = '
square

1
=1 Y Y = ' untransformed data

1
=0.5
Y Y = '
square root

1
=0.333
3
Y Y = '
cube root

1
=0 ) ln(Y Y = ' logarithm

1
=-0.5
Y
Y
1
= '
inverse square root

1
=-1
Y
Y
1
= '
reciprocal

Note that as
1
0, the power transformation approaches a logarithm.


Sample StatFolio: boxcox.sgp

STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 2
Sample Data:
The file plasma.sgd contains data presented by Neter et al. (1998) showing the plasma level of a
polyamine for n =25 healthy children. A portion of the data is shown below:

Age Plasma Level
0 13.44
0 12.84
0 11.91
0 20.09
0 15.6
1 10.11
1 11.38
1 10.28
1 8.96
1 8.59
2 9.83
2 9
2 8.65


It is desired to determine a model relating the plasma level to the age of the child.

Data Input
The data input dialog box requests the names of the columns containing the dependent variable
Y and the independent variable X:



- Y: numeric column containing the n observations for the dependent variable Y.

- X: numeric column containing the n values for the independent variable X.

- Select: subset selection.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 3
Analysis Summary
In relating the two variables, the procedure fits a model of the form

c | | + + = X W
1 0
(2)

where the dependent variable W is related to Y according to


( ) | |
( )

+ +
+ +
=
2 2
2 1
ln 1
1 1
1


Y K
Y K
W if
0
0
1
1
=
=

(3)

and
n
n
i
i
Y K
/ 1
1
2 2
) (
(

+ =
[
=
(4)
1
2 1
1
1
1

=

K
K (5)

Note that K
2
is the geometric mean of Y+
2
. Following Box and Cox (1964), the optimal
transformation is the one that minimizes the mean squared error for W. The reason for using the
standardized variable W instead of Y' is to adjust the magnitude of the error sum of squares for
the effect of the power transformation.

The Analysis Summary shows the optimal power and the resulting model:

Box-Cox Transformations - Plasma Level vs. Age
Power =-0.506 Shift =0.0
Dependent variable: Plasma Level
Independent variable: Age
Standard T
Parameter Estimate Error Statistic P-Value
Intercept 37.6283 0.399299 94.2359 0.0000
Slope -1.99141 0.163013 -12.2162 0.0000

Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 198.285 1 198.285 149.24 0.0000
Residual 30.5593 23 1.32866
Total (Corr.) 228.845 24

Correlation Coefficient =-0.93084
R-squared =86.6463 percent
Standard Error of Est. =1.15268

Approximate 95% confidence interval for power: -1.116 to 0.063

Included in the output are:

- Power and shift parameters: the values of
1
and
2
. By default, the power parameter is
optimized, while the shift parameter is set to 0. This may be changed using Analysis Options.
Also included at the bottom of the screen is an approximate confidence interval for
1
at the
default system confidence level.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 4



- Coefficients: the estimated coefficients, standard errors, t-statistics, and P values. The
estimates of the model coefficients can be used to write the fitted equation, which in the
example is

W =37.6386 1.99141 age (6)

The t-statistic tests the null hypothesis that the corresponding model parameter equals 0,
versus the alternative hypothesis that it does not equal 0. Small P-Values (less than 0.05 if
operating at the 5% significance level) indicate that a model coefficient is significantly
different from 0. In the sample data, both the intercept and slope are statistically significant.

- Analysis of Variance: decomposition of the variability of the dependent variable W into a
model sums of squares and a residual or error sum of squares. Of particular interest is the F-
test and its associated P-value, which tests the statistical significance of the fitted model. A
small P-Value (less than 0.05 if operating at the 5% significance level) indicates that a
significant linear relationship exists between Y and X. In the sample data, the model is highly
significant.

- Statistics: summary statistics for the fitted model, including:

Correlation coefficient - measures the strength of the linear relationship between W and X on
a scale ranging from -1 (perfect negative linear correlation) to +1 (perfect positive linear
correlation).

R-squared - represents the percentage of the variability in W that has been explained by the
fitted regression model, ranging from 0% to 100%.

Standard Error of Est. the estimated standard deviation of the residuals (the deviations
around the model). This value is used to create prediction limits for new observations.

Mean Absolute Error the average absolute value of the residuals.

In the sample data, the transformation selected is very close to an inverse square root, implying
that l PlasmaLeve / 1 is a linear function of Age. According to the confidence interval, however,
the actual optimal transformation could be anywhere between a reciprocal and a logarithm.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 5
Analysis Options


- Power: the value of the power parameter
1
. If Optimize is selected, this serves as the
starting value of the optimization search when OK is pressed. If Optimize is not selected, this
is the value used for the transformation.

- Shift: the value of the power parameter
2
. This value is subtracted from the dependent
variable Y before the power transformation is performed.

- Optimize: whether to optimize the power parameter or use the specified value.


Plot of Fitted Model
This pane shows the fitted model, together with confidence limits and prediction limits if desired.
Plot of Fitted Model
Power=-0.50625, Shift=0.0
Age
P
l
a
s
m
a

L
e
v
e
l
0 1 2 3 4
0
4
8
12
16
20
24


The plot includes:

- The line of best fit or prediction equation. This is the equation that would be used to
predict values of the dependent variable Y given values of the independent variable X.
Note that it does a relatively good job of picking up the increased variability of Plasma
Level at low Ages, as well as the curvature in the relationship.

- Confidence intervals for the mean response at X. These are the inner bounds in the
above plot and describe how well the location of the line has been estimated given the
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 6
available data sample. As the size of the sample n increases, these bounds will become
tighter. You should also note that the width of the bounds varies as a function of X, with
the line estimated most precisely near the average value x .

- Prediction limits for new observations. These are the outer bounds in the above plot and
describe how precisely one could predict where a single new observation would lie.
Regardless of the size of the sample, new observations will vary around the true line.

The inclusion of confidence limits and prediction limits and their default confidence level is
determined by settings on the ANOVA/Regression tab of the Preferences dialog box, accessible
from the Edit menu.

Pane Options



- Include: the limits to include on the plot.

- Confidence Level: the confidence percentage for the limits.

- X-Axis Resolution: the number of values of X at which the line is determined when plotting.
Higher resolutions result in smoother plots.

- Type of Limits: whether to plot two-sided confidence intervals or one-sided confidence
bounds.

MSE Comparison Plot
When optimizing the transformation, the power is sought that minimizes the mean squared error
of the fit of W as a function of X. To illustrate the result of the search, the MSE Comparison Plot
shows the mean squared error in the vicinity of the optimal value:
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 7
MSE Comparison
lambda2 =0.0
lambda1
M
S
E
-2 -1 0 1 2
0
2
4
6
8
10
12

Vertical lines are drawn at the derived
1
and its confidence limits. Notice that the MSE reaches
a minimum near
1
= 0.5, although it is relatively flat in a wide region around the optimal
value, indicating that the power could be changed quite a bit without hurting the model
substantially.

Pane Options



- Minimum Lambda1: smallest value of
1
to include in the plot.

- Maximum Lambda1: largest value of
1
to include in the plot.

- Resolution: number of different values of
1
at which to calculate the MSE.

STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 8
MSE Comparison Table
This table tabulates the values plotted by the MSE Comparison Plot.

MSE Comparison Table
Shift (lambda2): 0.0
lambda1 MSE
-1.0 1.4743
-0.95 1.44668
-0.9 1.42193
-0.85 1.40006
-0.8 1.38107
-0.75 1.36496
-0.7 1.35177
-0.65 1.34151
-0.6 1.33421
-0.55 1.32992
-0.5 1.32868
-0.45 1.33055
-0.4 1.33559
-0.35 1.34388
-0.3 1.35549
-0.25 1.37052
-0.2 1.38907
-0.15 1.41125
-0.1 1.43718
-0.05 1.467
0.0 1.50085

The Pane Options are the same as for the plot.


Skewness and Kurtosis Plot
This plot shows the values of the standardized skewness and standardized kurtosis as a function
of the power parameter
1
.
skewness
kurtosis
Skewness and Kurtosis Plot
lambda2 =0.0
lambda1
-2 -1 0 1 2
-2
0
2
4
6

The standardized skewness and standardized kurtosis should both be between 2 and +2 for a
transformation that adequately normalizes the data. The plot shows horizontal lines at 2 and +2,
with the vertical lines indicating the optimal value of
1
and its confidence limits.

Clearly, there is a wide range of values for
1
that would create a reasonable transformation of
the data.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 9

Lack-of-Fit Test
When more than one observation has been recorded at the same value of X, a lack-of-fit test can
be performed to determine whether the selected model adequately describes the relationship
between Y and X. The Lack-of-Fit pane displays the following table:

Analysis of Variance with Lack-of-Fit
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 198.286 1 198.286 149.24 0.0000
Residual 30.5593 23 1.32866
Lack-of-Fit 3.83617 3 1.27872 0.96 0.4322
Pure Error 26.7231 20 1.33616
Total (Corr.) 228.846 24

The lack-of-fit test decomposes the residual sum of squares of the transformed values W into 2
components:

1. Pure error: variability of the W values at the same value of X.
2. Lack-of-fit: variability of the average W values around the fitted model.

Of primary interest is the P-Value for lack-of-fit. A small P-value (below 0.05 if operating at the
5% significance level) indicates that the selected model does not adequately describe the
observed relationship.

For the example data, the large P-value indicates that the linear model adequately explains the
relationship between Plasma Level and Age.

Observed versus Predicted
The Observed versus Predicted plot shows the observed values of Y on the vertical axis and the
predicted values Y

on the horizontal axis, in the untransformed metric.


Plot of Plasma Level
predicted
o
b
s
e
r
v
e
d
0 4 8 12 16 20 24
0
4
8
12
16
20
24

If the model fits well, the points should be randomly scattered around the diagonal line. It is
sometimes possible to see curvature in this plot, which would indicate the need for a curvilinear
model rather than a linear model. In this case, the change in variability in the above plot as the
predicted values increase is not a concern, since that was stabilized by the Box-Cox
transformation.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 10

Residual Plots
As with all statistical models, it is good practice to examine the residuals. In a regression, the
residuals are defined by

(7)
i i i
W W e

=

i.e., the residuals are the differences between the transformed data values and the fitted linear
regression model.

The Box-Cox Transformations procedure creates 3 residual plots:

1. versus X.
2. versus predicted value W

.
3. versus row number.

Residuals versus X
This plot is helpful in visualizing how well the transformation accounted for any curvature in the
data.
Residual Plot
Age
S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
0 1 2 3 4
-2.7
-1.7
-0.7
0.3
1.3
2.3
3.3

The residuals should be randomly scattered around 0.


STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 11
Residuals versus Predicted
This plot is helpful in visualizing how well the model dealt with any heteroscedasticity in the
data.
Residual Plot
predicted Plasma Level
S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
5.4 7.4 9.4 11.4 13.4 15.4
-2.7
-1.7
-0.7
0.3
1.3
2.3
3.3

If the transformation was effective, the variability should be approximately equal everywhere.


Residuals versus Observation
This plot shows the residuals versus row number in the datasheet:
Residual Plot
row number
S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
0 5 10 15 20 25
-2.7
-1.7
-0.7
0.3
1.3
2.3
3.3

If the data are arranged in chronological order, any pattern in the data might indicate an outside
influence.

Pane Options



The following residuals may be plotted on each residual plot:

1. Residuals the residuals from the least squares fit.
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 12
2. Studentized residuals the difference between the observed values w
i
and the predicted
values
i
w when the model is fit using all observations except the i-th, divided by the
estimated standard error. These residuals are sometimes called externally deleted
residuals, since they measure how far each value is from the fitted model when that
model is fit using all of the data except the point being considered. This is important,
since a large outlier might otherwise affect the model so much that it would not appear to
be unusually far away from the line.


Unusual Residuals
Once the model has been fit, it is useful to study the residuals to determine whether any outliers
exist that should be removed from the data. The Unusual Residuals pane lists all observations
that have Studentized residuals of 2.0 or greater in absolute value.

Unusual Residuals
Predicted Studentized
Row X Y Y Residual Residual
4 0.0 20.09 13.925 6.16496 2.22
18 3.0 5.14 6.6342 -1.4942 -2.64

Studentized residuals greater than 3 in absolute value correspond to points more than 3 standard
deviations from the fitted model, which is an extremely rare event for a normal distribution. Note
that row #18 is more than 2.5 standard deviations out and would be worth investigating further.

Points can be removed from the fit while examining the Plot of the Fitted Model by clicking on a
point and then pressing the Exclude/Include button on the analysis toolbar:
Plot of Fitted Model
Power=-0.629648, Shift=0.0
Age
P
l
a
s
m
a

L
e
v
e
l
0 1 2 3 4
0
4
8
12
16
20
24

Excluded points are marked with an X. For the sample data, removing row #18 has little effect
on the fitted model or optimal transformation.

Influential Points
In fitting a regression model, all observations do not have an equal influence on the parameter
estimates in the fitted model. In a simple linear regression, points located at very low or very
high values of X have greater influence than those located nearer to the mean of X. The
Influential Points pane displays any observations that have high influence on the fitted model:
STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 13

Influential Points
Predicted Studentized
Row X Y Y Residual Leverage
Average leverage of single data point =0.0833333

The above table shows every point with leverage equal to 3 or more times that of an average data
point, where the leverage of an observation is a measure of its influence on the estimated model
coefficients. In general, values with leverage exceeding 5 times that of an average data value
should be examined closely, since they have unusually large impact on the fitted model. In the
sample data, there are no observations with unusually large leverage.

Forecasts
The Forecasts pane creates predictions using the fitted model.

Predicted Values
95.00% 95.00%
Predicted Prediction Limits Confidence Limits
X Y Lower Upper Lower Upper
0.0 14.0273 10.1419 21.0949 12.5208 15.8534
1.0 10.5559 8.06887 14.5887 9.85816 11.3378
2.0 8.29468 6.58021 10.8784 7.89718 8.72585
3.0 6.72855 5.4718 8.53447 6.38003 7.1092
4.0 5.59288 4.62323 6.94395 5.23331 5.99452
5.0 4.73945 3.95917 5.80556 4.37337 5.15824

Included in the table are:

- X - the value of the independent variable at which the prediction is to be made.

- Predicted Y - the predicted value of the dependent variable using the fitted model.

- Prediction limits - prediction limits for new observations at the selected level of
confidence (corresponds to the outer bounds on the plot of the fitted model).

- Confidence limits - confidence limits for the mean value of Y at the selected level of
confidence (corresponds to the inner bounds on the plot of the fitted model).

For example, at X =3, 95% of all children would be expected to have plasma levels between
5.47 and 8.53.

STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 14
Pane Options



- Confidence Level: confidence percentage for the intervals.

- Type of Limits: whether to display two-sided limits or one-sided bounds.

- Forecast at X: up to 10 values of X at which to make predictions.

Save Results
The following results may be saved to the datasheet:

1. Predicted Values the predicted value of Y corresponding to each of the n observations.
2. Lower Limits for Predictions the lower prediction limits for each predicted value.
3. Upper Limits for Predictions the upper prediction limits for each predicted value.
4. Lower Limits for Forecast Means the lower confidence limits for the mean value of Y
at each of the n values of X.
5. Upper Limits for Forecast Means the upper confidence limits for the mean value of Y at
each of the n values of X.
6. Residuals the n residuals.
7. Studentized Residuals the n Studentized residuals.
8. Leverages the leverage values corresponding to the n values of X.
9. Transformed Data the n transformed values W.

Note: If limits are saved, they will correspond to the settings on the Forecasts pane. If two-sided
limits are displayed in the Forecasts table, then the saved limits will also be two-sided. If one-
sided bounds are displayed in the table, then the saved limits will also be one-sided.

Calculations

The linear regression is performed on the transformed values W. Prediction limits are calculated
in the transformed metric and inverted before being displayed.

STATGRAPHICS Rev. 7/3/2009
2009 by StatPoint Technologies, Inc. Box-Cox Transformations - 15
For details on the calculations, see the Simple Regression documentation.

You might also like