You are on page 1of 23

Kleinbaum et al., Applied Regression Analysis and Other Multivariable Methods, 1988.

Hastie et al., The Elements of Statistical Learning, 2009


-an excellent text, available for free at http://www-stat.stanford.edu/~tibs/ElemStatLearn/

44 3. Linear Methods for Regression

3.2 Linear Regression Models and Least Squares


As introduced in Chapter 2, we have an input vector X T = (X1 , X2 , . . . , Xp ),
and want to predict a real-valued output Y . The linear regression model
has the form
Xp
f (X) = β0 + Xj βj . (3.1)
j=1

The linear model either assumes that the regression function E(Y |X) is
linear, or that the linear model is a reasonable approximation. Here the
βj ’s are unknown parameters or coefficients, and the variables Xj can come
from different sources:
• quantitative inputs;
• transformations of quantitative inputs, such as log, square-root or
square;
• basis expansions, such as X2 = X12 , X3 = X13 , leading to a polynomial
representation;
• numeric or “dummy” coding of the levels of qualitative inputs. For
example, if G is a five-level factor input, we might create Xj , j =
1, . . . , 5, such that Xj = I(G = j). Together this group of Xj repre-
sents
P5 the effect of G by a set of level-dependent constants, since in
j=1 Xj βj , one of the Xj s is one, and the others are zero.

• interactions between variables, for example, X3 = X1 · X2 .


No matter the source of the Xj , the model is linear in the parameters.
Typically we have a set of training data (x1 , y1 ) . . . (xN , yN ) from which
to estimate the parameters β. Each xi = (xi1 , xi2 , . . . , xip )T is a vector
of feature measurements for the ith case. The most popular estimation
method is least squares, in which we pick the coefficients β = (β0 , β1 , . . . , βp )T
to minimize the residual sum of squares
N
X
RSS(β) = (yi − f (xi ))2
i=1
N ³
X p
X ´2
= yi − β 0 − xij βj . (3.2)
i=1 j=1

From a statistical point of view, this criterion is reasonable if the training


observations (xi , yi ) represent independent random draws from their popu-
lation. Even if the xi ’s were not drawn randomly, the criterion is still valid
if the yi ’s are conditionally independent given the inputs xi . Figure 3.1
illustrates the geometry of least-squares fitting in the IRp+1 -dimensional
3.2 Linear Regression Models and Least Squares 45



• • ••
•• •
• • •• •• • • •
• • • • •• •

• •
•• • • • • •• •• • •

•• • • • • X2
PSfrag replacements • •

X1

FIGURE 3.1. Linear least squares fitting with X ∈ IR2 . We seek the linear
function of X that minimizes the sum of squared residuals from Y .

space occupied by the pairs (X, Y ). Note that (3.2) makes no assumptions
about the validity of model (3.1); it simply finds the best linear fit to the
data. Least squares fitting is intuitively satisfying no matter how the data
arise; the criterion measures the average lack of fit.
How do we minimize (3.2)? Denote by X the N × (p + 1) matrix with
each row an input vector (with a 1 in the first position), and similarly let
y be the N -vector of outputs in the training set. Then we can write the
residual sum-of-squares as
RSS(β) = (y − Xβ)T (y − Xβ). (3.3)
This is a quadratic function in the p + 1 parameters. Differentiating with
respect to β we obtain
∂RSS
= −2XT (y − Xβ)
∂β
(3.4)
∂ 2 RSS
= 2XT X.
∂β∂β T
Assuming (for the moment) that X has full column rank, and hence XT X
is positive definite, we set the first derivative to zero
XT (y − Xβ) = 0 (3.5)
to obtain the unique solution
β̂ = (XT X)−1 XT y. (3.6)
46 3. Linear Methods for Regression

PSfrag replacements

x2


x1
FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors. The outcome vector y is orthogonally projected onto the hyperplane
spanned by the input vectors x1 and x2 . The projection ŷ represents the vector
of the least squares predictions

The predicted values at an input vector x0 are given by fˆ(x0 ) = (1 : x0 )T β̂;


the fitted values at the training inputs are
ŷ = Xβ̂ = X(XT X)−1 XT y, (3.7)

where ŷi = fˆ(xi ). The matrix H = X(XT X)−1 XT appearing in equation


(3.7) is sometimes called the “hat” matrix because it puts the hat on y.
Figure 3.2 shows a different geometrical representation of the least squares
estimate, this time in IRN . We denote the column vectors of X by x0 , x1 , . . . , xp ,
with x0 ≡ 1. For much of what follows, this first column is treated like any
other. These vectors span a subspace of IRN , also referred to as the column
space of X. We minimize RSS(β) = ky − Xβk2 by choosing β̂ so that the
residual vector y − ŷ is orthogonal to this subspace. This orthogonality is
expressed in (3.5), and the resulting estimate ŷ is hence the orthogonal pro-
jection of y onto this subspace. The hat matrix H computes the orthogonal
projection, and hence it is also known as a projection matrix.
It might happen that the columns of X are not linearly independent, so
that X is not of full rank. This would occur, for example, if two of the
inputs were perfectly correlated, (e.g., x2 = 3x1 ). Then XT X is singular
and the least squares coefficients β̂ are not uniquely defined. However,
the fitted values ŷ = Xβ̂ are still the projection of y onto the column
space of X; there is just more than one way to express that projection
in terms of the column vectors of X. The non-full-rank case occurs most
often when one or more qualitative inputs are coded in a redundant fashion.
There is usually a natural way to resolve the non-unique representation,
by recoding and/or dropping redundant columns in X. Most regression
software packages detect these redundancies and automatically implement
3.2 Linear Regression Models and Least Squares 47

some strategy for removing them. Rank deficiencies can also occur in signal
and image analysis, where the number of inputs p can exceed the number
of training cases N . In this case, the features are typically reduced by
filtering or else the fitting is controlled by regularization (Section 5.2.3 and
Chapter 18).
Up to now we have made minimal assumptions about the true distribu-
tion of the data. In order to pin down the sampling properties of β̂, we now
assume that the observations yi are uncorrelated and have constant vari-
ance σ 2 , and that the xi are fixed (non random). The variance–covariance
matrix of the least squares parameter estimates is easily derived from (3.6)
and is given by

Var(β̂) = (XT X)−1 σ 2 . (3.8)

Typically one estimates the variance σ 2 by

X N
2 1
σ̂ = (yi − ŷi )2 .
N − p − 1 i=1

The N − p − 1 rather than N in the denominator makes σ̂ 2 an unbiased


estimate of σ 2 : E(σ̂ 2 ) = σ 2 .
To draw inferences about the parameters and the model, additional as-
sumptions are needed. We now assume that (3.1) is the correct model for
the mean; that is, the conditional expectation of Y is linear in X1 , . . . , Xp .
We also assume that the deviations of Y around its expectation are additive
and Gaussian. Hence

Y = E(Y |X1 , . . . , Xp ) + ε
Xp
= β0 + Xj βj + ε, (3.9)
j=1

where the error ε is a Gaussian random variable with expectation zero and
variance σ 2 , written ε ∼ N (0, σ 2 ).
Under (3.9), it is easy to show that

β̂ ∼ N (β, (XT X)−1 σ 2 ). (3.10)

This is a multivariate normal distribution with mean vector and variance–


covariance matrix as shown. Also

(N − p − 1)σ̂ 2 ∼ σ 2 χ2N −p−1 , (3.11)

a chi-squared distribution with N − p − 1 degrees of freedom. In addition β̂


and σ̂ 2 are statistically independent. We use these distributional properties
to form tests of hypothesis and confidence intervals for the parameters β j .
48 3. Linear Methods for Regression

0.01 0.02 0.03 0.04 0.05 0.06


Tail Probabilities

t30
t100
normal

PSfrag replacements

2.0 2.2 2.4 2.6 2.8 3.0


Z

FIGURE 3.3. The tail probabilities Pr(|Z| > z) for three distributions, t 30 , t100
and standard normal. Shown are the appropriate quantiles for testing significance
at the p = 0.05 and 0.01 levels. The difference between t and the standard normal
becomes negligible for N bigger than about 100.

To test the hypothesis that a particular coefficient βj = 0, we form the


standardized coefficient or Z-score

β̂j
zj = √ , (3.12)
σ̂ vj

where vj is the jth diagonal element of (XT X)−1 . Under the null hypothesis
that βj = 0, zj is distributed as tN −p−1 (a t distribution with N − p − 1
degrees of freedom), and hence a large (absolute) value of zj will lead to
rejection of this null hypothesis. If σ̂ is replaced by a known value σ, then
zj would have a standard normal distribution. The difference between the
tail quantiles of a t-distribution and a standard normal become negligible
as the sample size increases, and so we typically use the normal quantiles
(see Figure 3.3).
Often we need to test for the significance of groups of coefficients simul-
taneously. For example, to test if a categorical variable with k levels can
be excluded from a model, we need to test whether the coefficients of the
dummy variables used to represent the levels can all be set to zero. Here
we use the F statistic,

(RSS0 − RSS1 )/(p1 − p0 )


F = , (3.13)
RSS1 /(N − p1 − 1)

where RSS1 is the residual sum-of-squares for the least squares fit of the big-
ger model with p1 +1 parameters, and RSS0 the same for the nested smaller
model with p0 + 1 parameters, having p1 − p0 parameters constrained to be
3.2 Linear Regression Models and Least Squares 49

zero. The F statistic measures the change in residual sum-of-squares per


additional parameter in the bigger model, and it is normalized by an esti-
mate of σ 2 . Under the Gaussian assumptions, and the null hypothesis that
the smaller model is correct, the F statistic will have a Fp1 −p0 ,N −p1 −1 dis-
tribution. It can be shown (Exercise 3.1) that the zj in (3.12) are equivalent
to the F statistic for dropping the single coefficient βj from the model. For
large N , the quantiles of the Fp1 −p0 ,N −p1 −1 approach those of the χ2p1 −p0 .
Similarly, we can isolate βj in (3.10) to obtain a 1−2α confidence interval
for βj :
1 1
(β̂j − z (1−α) vj2 σ̂, β̂j + z (1−α) vj2 σ̂). (3.14)

Here z (1−α) is the 1 − α percentile of the normal distribution:

z (1−0.025) = 1.96,
z (1−.05) = 1.645, etc.

Hence the standard practice of reporting β̂ ± 2 · se(β̂) amounts to an ap-


proximate 95% confidence interval. Even if the Gaussian error assumption
does not hold, this interval will be approximately correct, with its coverage
approaching 1 − 2α as the sample size N → ∞.
In a similar fashion we can obtain an approximate confidence set for the
entire parameter vector β, namely
(1−α)
Cβ = {β|(β̂ − β)T XT X(β̂ − β) ≤ σ̂ 2 χ2p+1 }, (3.15)
(1−α)
where χ2` is the 1 − α percentile of the chi-squared distribution on `
(1−0.05) (1−0.1)
degrees of freedom: for example, χ25 = 11.1, χ25 = 9.2. This
confidence set for β generates a corresponding confidence set for the true
function f (x) = xT β, namely {xT β|β ∈ Cβ } (Exercise 3.2; see also Fig-
ure 5.4 in Section 5.2.2 for examples of confidence bands for functions).

3.2.1 Example: Prostate Cancer


The data for this example come from a study by Stamey et al. (1989). They
examined the correlation between the level of prostate-specific antigen and
a number of clinical measures in men who were about to receive a radical
prostatectomy. The variables are log cancer volume (lcavol), log prostate
weight (lweight), age, log of the amount of benign prostatic hyperplasia
(lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp),
Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).
The correlation matrix of the predictors given in Table 3.1 shows many
strong correlations. Figure 1.1 (page 3) of Chapter 1 is a scatterplot matrix
showing every pairwise plot between the variables. We see that svi is a
binary variable, and gleason is an ordered categorical variable. We see, for
50 3. Linear Methods for Regression

TABLE 3.1. Correlations of predictors in the prostate cancer data.

lcavol lweight age lbph svi lcp gleason


lweight 0.300
age 0.286 0.317
lbph 0.063 0.437 0.287
svi 0.593 0.181 0.129 −0.139
lcp 0.692 0.157 0.173 −0.089 0.671
gleason 0.426 0.024 0.366 0.033 0.307 0.476
pgg45 0.483 0.074 0.276 −0.030 0.481 0.663 0.757

TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the
coefficient divided by its standard error (3.12). Roughly a Z score larger than two
in absolute value is significantly nonzero at the p = 0.05 level.

Term Coefficient Std. Error Z Score


Intercept 2.46 0.09 27.60
lcavol 0.68 0.13 5.37
lweight 0.26 0.10 2.75
age −0.14 0.10 −1.40
lbph 0.21 0.10 2.06
svi 0.31 0.12 2.47
lcp −0.29 0.15 −1.87
gleason −0.02 0.15 −0.15
pgg45 0.27 0.15 1.74

example, that both lcavol and lcp show a strong relationship with the
response lpsa, and with each other. We need to fit the effects jointly to
untangle the relationships between the predictors and the response.
We fit a linear model to the log of prostate-specific antigen, lpsa, after
first standardizing the predictors to have unit variance. We randomly split
the dataset into a training set of size 67 and a test set of size 30. We ap-
plied least squares estimation to the training set, producing the estimates,
standard errors and Z-scores shown in Table 3.2. The Z-scores are defined
in (3.12), and measure the effect of dropping that variable from the model.
A Z-score greater than 2 in absolute value is approximately significant at
the 5% level. (For our example, we have nine parameters, and the 0.025 tail
quantiles of the t67−9 distribution are ±2.002!) The predictor lcavol shows
the strongest effect, with lweight and svi also strong. Notice that lcp is
not significant, once lcavol is in the model (when used in a model without
lcavol, lcp is strongly significant). We can also test for the exclusion of
a number of terms at once, using the F -statistic (3.13). For example, we
consider dropping all the non-significant terms in Table 3.2, namely age,
3.2 Linear Regression Models and Least Squares 51

lcp, gleason, and pgg45. We get

(32.81 − 29.43)/(9 − 5)
F = = 1.67, (3.16)
29.43/(67 − 9)
which has a p-value of 0.17 (Pr(F4,58 > 1.67) = 0.17), and hence is not
significant.
The mean prediction error on the test data is 0.521. In contrast, predic-
tion using the mean training value of lpsa has a test error of 1.057, which
is called the “base error rate.” Hence the linear model reduces the base
error rate by about 50%. We will return to this example later to compare
various selection and shrinkage methods.

3.2.2 The Gauss–Markov Theorem


One of the most famous results in statistics asserts that the least squares
estimates of the parameters β have the smallest variance among all linear
unbiased estimates. We will make this precise here, and also make clear
that the restriction to unbiased estimates is not necessarily a wise one. This
observation will lead us to consider biased estimates such as ridge regression
later in the chapter. We focus on estimation of any linear combination of
the parameters θ = aT β; for example, predictions f (x0 ) = xT0 β are of this
form. The least squares estimate of aT β is

θ̂ = aT β̂ = aT (XT X)−1 XT y. (3.17)

Considering X to be fixed, this is a linear function cT0 y of the response


vector y. If we assume that the linear model is correct, aT β̂ is unbiased
since

E(aT β̂) = E(aT (XT X)−1 XT y)


= aT (XT X)−1 XT Xβ
= aT β. (3.18)

The Gauss–Markov theorem states that if we have any other linear estima-
tor θ̃ = cT y that is unbiased for aT β, that is, E(cT y) = aT β, then

Var(aT β̂) ≤ Var(cT y). (3.19)

The proof (Exercise 3.3) uses the triangle inequality. For simplicity we have
stated the result in terms of estimation of a single parameter aT β, but with
a few more definitions one can state it in terms of the entire parameter
vector β (Exercise 3.3).
Consider the mean squared error of an estimator θ̃ in estimating θ:

MSE(θ̃) = E(θ̃ − θ)2


= Var(θ̃) + [E(θ̃) − θ]2 . (3.20)
i-SECTION: EDUCATION www.rsc.org/analyst | The Analyst

The uncertainty of a result from a linear


calibration{
D. Brynn Hibbert
DOI: 10.1039/b615398d

The standard error of a result obtained from a straight line calibration is given by a
well known ISO-endorsed expression. Its derivation and use are explained and the
approach is extended for any function that is linear in the coefficients, with an
example of a weighted quadratic calibration in ICPAES. When calculating the
standard error of an estimate, if QC data is available it is recommended to use the
repeatability of the instrumental response, rather than the standard error of
the regression, in the equation.

Introduction the calibration function is often still then the best fit is not realised by this
referred to as a ‘calibration line’ or process. For example if there is error in x
Calibration of a measuring system is at ‘calibration curve’. as well as y then an ‘error-in-variables’
the heart of many chemical measure- In this Education article the commonly model is indicated.1,2 If the error is
ments. It has direct relevance to the used expression for the standard error of proportional to concentration then a
traceability of the measurement and a result obtained from a straight line weighted least squares model should be
contributes to the measurement uncer- calibration is extended to a quadratic used.3 The consequences of failure of the
tainty. A measurement can be seen as a calibration, and the case where weighted linear model have been demonstrated by
two-step process in which an instrument regression is necessary. Spreadsheet Hibbert and Mulholland.4
is calibrated using one or more stan- recipes are given to accomplish these The least squares estimates of a and b
dards, followed by presentation of a calculations. can be obtained directly from the cali-
sample to the instrument and the assign- bration data
ment of the value of the measurand. P
Instrumental analytical methods, parti-
Linear calibration by classical fðxi {xÞðyi {yÞg
cularly chromatographic, spectroscopic least squares regression b~ i P
^ (3)
x Þ2
ðxi {
and electrochemical methods, are usually In calibration a series of x,y pairs are i
calibrated over a range of concentrations obtained where the response of an ^ y{^
a~ bx (4)
of the analyte. Often the calibrations instrument y is obtained for a test
are assumed (or arranged to be) linear material with measurand value x. where the sum is over all data pairs, and
and in the past, a graph was prepared by (From now on, the x quantity will be x̄ and ȳ are the average values of x and y
drawing the best straight line by eye called ‘concentration’, being the most in the calibration set. These estimates
through the points. Having obtained a common quantity measured in chemis- minimise the standard error of the
response from the instrument from the try). A function of the form regression, sy/x (also known as the
sample to be analysed, the concentration residual standard deviation)
of this sample was read off the graph, Y = a + bxi (1) vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
going from the instrument response on uP
u ðyi {^
t i yi Þ2
the y-axis to the concentration on the can be fitted to the data, where the sy=x ~ (5)
x-axis. While drawing a graph for the estimates of the parameters a and b, n{2
purpose of calibration is no longer done still called intercept and slope respec-
in practice, with a spreadsheet perform- tively, are â and ^b, and for a particular where ŷi is the value of y obtained from
ing a least squares regression to obtain response eqn (2).
the equation of the best straight line, Having determined a calibration
az^
yi ~^ bxi zei (2) function the equation must be inverted
to assign a concentration ðx ^0 Þ given
School of Chemistry, University of New South Classical least squares regression a response (y0) from an unknown test
Wales, Sydney, NSW 2052, Australia.
E-mail: b.hibbert@unsw.edu.au; makes three assumptions about the sample.
Fax: +61 2 9385 6141; Tel: +61 2 9385 4713 system: the linear model holds for the y0 {a
{ Electronic supplementary information (ESI) data; errors are only in y; these errors are ^0 ~
x (6)
available: Spreadsheet showing errors in b
normally distributed and independent of
calibration data, which contains the worked
examples in Box 1 and Box 2. See DOI: the value of x (so-called homoscedacity). Note that the carets on a and b
10.1039/b615398d If any of these assumptions is not met, will now be omitted. Eqn (6) can be

This journal is ß The Royal Society of Chemistry 2006 Analyst, 2006, 131, 1273–1278 | 1273
written in terms of the mean x and y eqn (6), a and b are correlated, but b and presented with the sample (first term) and
values from the calibration, to remove ȳ are not. the component due to the lack of fit of the
the constant term a and its correlation  2  2 calibration line (second term).
with b when the standard error is Ly y
L
V ðx
^0 Þ~ V ðy0 Þz V ð

calculated. x0
L^ x0
L^ Other linear calibration functions
 2
y0 {y Lb V ðy0 Þ V ðyÞ (11) Equations that are linear in the para-
^0 ~
x {
x (7) z V ðbÞ~ 2 z 2
b L^x0 b b meters, for example a quadratic, can be
V ðbÞ ðy0 { yÞ2 derived through eqn (10). Quadratic
The standard error of the estimate of z |
b 2 b2 functions are used in calibration of
the concentration from the mean of m
ICPAES and ICPMS analyses of ele-
responses, y0, is usually given as
The variance in the response to the ments with a wide range of concentra-
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi unknown (y0) is usually estimated by
u tions (x), and quadratic functions have
sy=x u 1 1 ðy0 {yÞ2
sx^0 ~ u z z (8) the variance of the regression s2y=x . If y0 been shown to be a better function than
b tm n Pn
b2 ðxi { xÞ2 is the mean of m independent observa- linear for some HPLC applications.10
i~1 tions then The observed instrumental response y
where there are n points in the calibra- (usually a number of ‘counts’ of a
s2y=x
tion, and x̄ and ȳ are the means of the V ðy0 Þ~ (12) detector, or absorbance of a spectro-
m photometric detector) is
calibration data.5 Eqn (8) is quoted
with a caveat that this is an appro- Similarly the variance of the mean of
y = a + b1x + b2x2 (17)
ximation, which stems from the the calibration responses (ȳ) is
statistical difficulties of an error model
applied to the inversion of eqn (2).6,7 A s2y=x As with the straight line calibration the
V ð
yÞ~ (13) constant, a, is eliminated by moving the
rigorous derivation of the confidence n
origin of the calibration to the coordinate
interval on x ^0 was given by Fieller in
The variance of the slope is9 origin
1954.8
 
s2y=x y{ y~b1 ðx{ xÞzb2 x2 {x2 (18)
Derivation of the equation for the V ðbÞ~ P (14)
xÞ2
ðxi {
standard error of an estimated value i
In the analysis of a sample, a response
The derivation stems from a first-order and therefore (y0) allows calculation of a concentration
expansion of the variance by Taylor’s ðx
^0 Þ
theorem. The procedure is straightfor- V ðx
^0 Þ~
ward and only requires knowledge of the 0 1 x
^0 ~
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
variances, and possibly covariances, of s2y=x B1 1 ðy0 {yÞ C 2 (15)
the parameters and the ability to differ- B z z C {b1 z b21 {4b2  y{y0 {b1 x {b2 x2 (19)
b2 @m n Pn
2
A
b2 ðxi {

entiate the equation assigning the result 2b2
i~1
with respect to each parameter. For a
general function which is eqn (8) squared. It is seen that Applying eqn (10) to the variance of x
^0
V(y0) is estimated by eqn (12), that is from eqn (19)
X = f(x1, x2, …, xn) (9) from the standard error of the regression.  2  2
x0
L^ x0
L^
 2  2 It is possible that calibration measure- V ðx
^0 Þ~ V ðb1 Þz V ðb2 Þ
LX LX Lb1 Lb2
V ðX Þ~ V ðx1 Þz V ðx2 Þ ments have been made under different
Lx1 Lx2  2  2
conditions than routine measurements, L^x0 x0
L^
   z V ð
yÞz V ðy0 Þ (20)
LX LX for which a separate estimate of the Ly Ly0
z . . . z2 C ðx1 ,x2 Þ (10)   
Lx1 Lx2 standard deviation of the responses
L^x0 x0
L^
   might well be available from in house z2 C ðb1 ,b2 Þ
LX LX Lb1 Lb2
z2 C ðx1 ,x3 Þz . . . QC measurements of repeatability.
Lx1 Lx3
Therefore, if such data is known, then The assumptions of the regression
where V(x1) is the variance of x1 and instead of eqn (8), it would be clearer and give V (x̄) = 0 and V (x2 ) = 0, and
C(x1,x2) is the covariance between x1 better to calculate sx^0 as the further assumption of independence
and x2. sx^0 ~ between the indications and the parameters
Eqn (10) is applied to the variance of vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi of the regression is made. Note that
u
^0 given by eqn (7) with the assumptions
x sr sy=x u1 ðy0 { yÞ2 (16) eqn (19) can be differentiated and for a
pffiffiffiffi z u z
that the variables are independent (all b m b tn P n linear system the covariance matrix of the
covariance terms are zero), and that b2 ðxi { x Þ2 coefficients (b) is given by s2(xTx)21 where
i~1
V(x̄) = 0, from the base assumptions s2 is the variance of y which can be
of classical least squares (all error in which distinguishes between the variance estimated by s2y=x and the matrix x is the
y). This is why we use eqn (7) and not of the response when the instrument is design matrix of the calibration (a column

1274 | Analyst, 2006, 131, 1273–1278 This journal is ß The Royal Society of Chemistry 2006
Table 1 Differentials in eqn (20) for the The covariance matrix is distribution function. In the case of
calculation of the variance of an estimated normally-distributed data this is the
value from a quadratic calibration y 2 ȳ = V~ two-tailed Student t value for the
b1(x 2 x̄) + b2 (x2 2 x2 ). The discriminant of  
the solution for x is  {1 yT Wy{bT xT Wy (22) degrees of freedom of the calibration
T
x Wx | (n 2 1 or n 2 2). The reason that at
ðn{pÞ
D = b12 2 4b2 (ȳ 2 y0 2 b1x̄ 2 b2x2 ) least five calibration solutions of
with p the number of coefficients in the different concentrations should be
x0
L^ x0
L^
X in LX model and n the number of independent used, is that the resulting three
LX
concentrations. degrees of freedom has an associated
b1 {1z1=2D{1=2 ð2b1 z4b2 x
Þ Student’s t value for a = 0.05 of 3.18,
2b2 which then multiplies sx^0 to give a larger
Confidence intervals
confidence interval than is the case with
b2 b1 {D1=2
z To obtain a confidence interval the more points. For example, ten points
2b22
  standard error of the regression is multi- with eight degrees of freedom has a
1=2D{1=2 4y {4 z8b2 x2
yz4b1 x
0
plied by an appropriate point on the Student’s t value of 2.30.
2b2
21/2
ȳ 2D
y0 D21/2 Box 1. Calculation of standard error and 95% confidence
interval for an estimated concentration in a linear
calibration. (a) Data and calculations for error formula
of 1’s, followed by columns of the including output from LINEST. (b) First rows of a calculation
x-values and x2-values used in the calibra- of the 95% confidence interval on an estimated
tion). Table 1 gives expressions for the concentration.
differentials in eqn (20). Kirkup and
Mulholland10 have derived a similar
expression but retained the constant
term. In the practical implementation
of their scheme, three covariance
terms must be calculated (C(a,b), C(a,c),
C(b,c)) in contrast to the single term in
eqn (20).

Weighted linear regression


Where a weighted regression is required
the equation for the variance of the
estimated concentration remains as
eqn (15), but the form of the covariance
matrix of the coefficients and the
standard error of the regression sy/x
need to take account of the weighting
matrix. There are two common
situations in which data must be
weighted. First if a transformation of
the indications has been used to obtain
a linear form (y A f(y) ), but the original
indications have normally distributed 
Lf ðyÞ {2
error, the weights are .
Ly
Secondly, if the data itself is hetero-
scedastic, with variance si2 for the i th
datum, then the weighting matrix is 1/si2.
If the system is both transformed and
heteroscedastic then the product of the
weights are applied. If W is a diagonal
matrix of the weights the coefficients, b, are
given by

b = (xTWx)21(xTWy) (21)

This journal is ß The Royal Society of Chemistry 2006 Analyst, 2006, 131, 1273–1278 | 1275
Example calculations Table 2 Differentials in eqn (25) for the Some points to consider for
calculation of the variance of an estimated
calibrations
Two systems are given here to illustrate value from a quadratic calibration forced
the methods described above, a linear through the origin.y = b1x + b2x2. The
Although there has been much discussion
discriminant of the function for an indication
calibration of sulfite using a channel y0 is D = b12 + 4b2y0 about appropriate calibration methods
biosensor and the weighted quadratic and the correct incorporation of compo-
x0
L^ x0
L^
calibration of the ICPAES analysis of X in LX nents in an estimate of measurement
LX
K+. An illustrative spreadsheet is given in uncertainty, for routine methods that
the electronic supplementary information b1 {1zD{1=2 b1 adopt linear calibration in a range in
(ESI){ to this paper. 2b2 which there is a good fit to the linear
b2 b1 {D1=2 D{1=2 y0 model, the calibration is usually not a
Linear calibration of sulfite z major source of uncertainty. The tradi-
2b22 b2
tional ‘‘four nines r2’’, i.e. a squared
The work up of data to produce a y0 D21/2
correlation coefficient greater than 0.9999,
calibration graph with 95% confidence
invariably yields a useable function.
intervals on estimated values is given
Despite arguments about the unthinking
in Box 1 and in the spreadsheet in the
confidence intervals diverge with increas- use of r2,11,12 Ellison has shown that there
ESI.{ All that is required for the
ing concentration, as the contribution is some utility in this statistic.13
calculations are the internal Excel func-
of the uncertainty of the quadratic term Correlations between parameters of a
tions LINEST, AVERAGE, SQRT and
(b2) increases (Fig. 2). The effect is calibration function are usually a signifi-
SUMSQ. ameliorated by the increasing negative cant component of the uncertainty and
correlation term. therefore can rarely be ignored. However
Quadratic calibration of ICP data
The instrumental software performs a
weighted quadratic calibration on the
blank subtracted data, forcing the line
through zero. While this procedure can
be questioned, forcing through zero leads
to the following equations for the con-
centration (x) in terms of the coefficients
(b1, b2) and the blank corrected response
of the sample (Y)

Y = b1x + b2x2 (23)


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
{b1 z b21 z4b2 y0
^0 ~
x (24)
2b2

The variance of the estimate is Fig. 1 Calibration for the routine ICPAES analysis of potassium. Five calibration points,
obtained from measured in triplicate, blank-corrected and fitted with a weighted quadratic regression through
 2  2 zero. Error bars are the 95% confidence interval of the mean of each point. Dashed lines are the
x0
L^ x0
L^ 95% confidence interval on estimated concentrations from the calibration.
V ðx
^ 0 Þ~ V ðb1 Þz V ðb2 Þz
Lb1 Lb2
 2    (25)
L^x0 x0
L^ x0
L^
V ðy0 Þz2 C ðb1 ,b2 Þ
Ly0 Lb1 Lb2

with V(b1), V(b2) and C(b1,b2) from the


covariance matrix eqn (22), and V(y0)
estimated from QC or validation data, or
the weights. Table 2 gives the equations
of the differentials in eqn (25).
The regression line and 95% con-
fidence interval of estimates of concen-
tration are calculated in the spreadsheet
shown in Box 2, and are graphed in
Fig. 1. The confidence intervals are
quite dependent on the errors in the
calibration points, but Fig. 1 is typical of Fig. 2 The fractional contributions of the components in the calibration function to the
a number of data sets processed. The standard error of the estimates in Fig. 1.

1276 | Analyst, 2006, 131, 1273–1278 This journal is ß The Royal Society of Chemistry 2006
Box 2. Calculation of standard error for a concentration of potassium from an ICPAES analysis
using a weighted quadratic calibration. The data are blank corrected and fitted through the
origin. (a) Data and calculation of standard deviations for the weights for each point. (b)
Matrix calculations for the variance/covariance matrix of the coefficients. (c) Calculation of
first rows for standard error of the estimate of concentration. The 95% confidence intervals
(not shown) are calculated from s(x0) as in Box 1(b).

This journal is ß The Royal Society of Chemistry 2006 Analyst, 2006, 131, 1273–1278 | 1277
the covariance matrix is usually available rarely is greater than the other terms. As 2 À. Martı́nez, J. Riu and F. X. Rius,
Chemom. Intell. Lab. Syst., 2000, 54,
and the inverse calibration function can discussed above, there is likely to be 61–73.
be differentiated. Thus the correct uncer- knowledge of the standard deviation of 3 W. Bremser and W. Hasselbarth, Anal.
tainty of an estimate can be calculated. the indication from QC data, and this Chim. Acta, 1997, 348, 61–69.
Correlations usually reduce the uncer- is likely to be a better estimate than 4 M. Mulholland and D. B. Hibbert,
J. Chromatogr., A, 1997, 762, 73–82.
tainty, so not including them will lead to the standard error of the regression.
5 ISO 11095, Linear calibration using refer-
overestimated confidence intervals. Therefore I recommend use of eqn (16) ence materials, International Organization
In the expression for the standard in such cases. for Standardization, Geneva, 1996.
error of an estimate from a linear 6 J. N. Miller and J. C. Miller, Statistics and
Chemometrics for Analytical Chemistry,
calibration (eqn (8)), the uncertainty Acknowledgements Pearson Education Ltd, Harlow, UK,
of the observed response when the 2005.
unknown is presented to the instrument The author thanks Dr Michael Wu 7 P. D. Lark, B. R. Craven and R. L. L.
usually has the greatest contribution. of the National Measurement Institute, Bosworth, The handling of chemical data,
Australia, for the ICPAES data Pergamon Press, Oxford, 1968.
When its standard deviation is estimated 8 E. C. Fieller, J. R. Stat. Soc., Ser. B, 1954,
by sy/x, the standard error of the regres- used here, and Dr Edith Chow for the 16, 175–183.
sion, it enters the equation as sy/x/(b!m) sulfite data. 9 D. B. Hibbert and J. J. Gooding, Data
where m is the number of replicate Spreadsheets and reproductions of Analysis for Chemistry, Oxford University
screen images are with permission of Press, New York, 2005.
measurements. As m is usually one or 10 L. Kirkup and M. Mulholland,
two, the term is always greater than that Microsoft Corporation. J. Chromatogr., A, 2004, 1029, 1–11.
for ȳ, sy/x/(b !n), where n is the number of 11 D. B. Hibbert, Accredit. Qual. Assur.,
points in the regression (typically five or References 2005, 10, 300–301.
12 W. Huber, Accredit. Qual. Assur., 2004, 9,
more). The contribution to uncertainty
1 S. De Jong and A. Phatak, in Partial Least 726.
of the slope, b, is zero in the middle of the Squares Regression, ed. S. Van Huffel, 13 S. L. R. Ellison, Accredit. Qual. Assur.,
calibration when (y0 2 ȳ) is zero, and SIAM, Philadelphia, PA, 1997, pp. 25–36. 2006, 11, 146–152.

1278 | Analyst, 2006, 131, 1273–1278 This journal is ß The Royal Society of Chemistry 2006

You might also like