You are on page 1of 5

Least Squares Estimation

SARA A. VAN DE GEER


Volume 2, pp. 10411045

in

Encyclopedia of Statistics in Behavioral Science

ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4

Editors

Brian S. Everitt & David C. Howell

John Wiley & Sons, Ltd, Chichester, 2005


Least Squares Estimation To write down the least squares estimator for the
linear regression model, it will be convenient to use
matrix notation. Let y = (y1 , . . . , yn ) and let X be
The method of least squares is about estimating the n p data matrix of the n observations on the p
parameters by minimizing the squared discrepancies variables

between observed data, on the one hand, and their x1,1 x1,p
. ..
expected values on the other (see Optimization X = .. . = ( x1 . . . xp ) , (3)
Methods). We will study the method in the context xn,1 xn,p
of a regression problem, where the variation in
one variable, called the response variable Y , can where xj is the column vector containing the n
be partly explained by the variation in the other observations on variable j , j = 1, . . . , n. Denote
variables, called covariables X (see Multiple Linear the squared length
 of an n-dimensional vector v by
Regression). For example, variation in exam results v2 = v v = ni=1 vi2 . Then expression (1) can be
Y are mainly caused by variation in abilities and written as
diligence X of the students, or variation in survival y Xb2 ,
times Y (see Survival Analysis) are primarily due to
variations in environmental conditions X. Given the which is the squared distance between the vector y
value of X, the best prediction of Y (in terms of mean and the linear combination b of the columns of the
square error see Estimation) is the mean f (X) of matrix X. The distance is minimized by taking the
Y given X. We say that Y is a function of X plus projection of y on the space spanned by the columns
noise: of X (see Figure 1).
Y = f (X) + noise. Suppose now that X has full column rank, that
is, no column in X can be written as a linear
The function f is called a regression function. It is to combination of the other columns. Then, the least
be estimated from sampling n covariables and their squares estimator is given by
responses (x1 , y1 ), . . . , (xn , yn ).  
Suppose f is known up to a finite number p n = (X X)1 X y. (4)

of parameters = (1 , . . . , p ) , that is, f = f . We
estimate by the value that gives the best fit to
the data. The least squares estimator, denoted by , The Variance of the Least Squares Estimator.
is that value of b that minimizes In order to construct confidence intervals for the
components of , or linear combinations of these

n
components, one needs an estimator of the covariance
(yi fb (xi ))2 , (1)
i=1
y
over all possible b.
The least squares criterion is a computationally
convenient measure of fit. It corresponds to maxi-
mum likelihood estimation when the noise is nor-
mally distributed with equal variances. Other mea-
sures of fit are sometimes used, for example, least
absolute deviations, which is more robust against out- xb
liers. (See Robust Testing Procedures).

Linear Regression. Consider the case where f is


a linear function of , that is,
x
f (X) = X1 1 + + Xp p . (2)

Here (X1 , . . . , Xp ) stand for the observed variables Figure 1 The projection of the vector y on the plane
used in f (X). spanned by X
2 Least Squares Estimation

matrix of . Now, it can be shown that, given X, the This gives


covariance matrix of the estimator is equal to 
100 50.5 33.8350



XX= 50.5 33.8350 25.5025 ,
(X X)1 2 . 33.8350 25.5025 20.5033

0.0937 0.3729 0.3092
where 2 is the variance of the noise. As an estimator 
(X X) 1
= 0.3729 1.9571 1.8189 .
of 2 , we take 0.3092 1.8189 1.8009
(10)
1  2
n
1
2 = y X2 = e , (5)
np n p i=1 i We simulated n independent standard normal
random variables e1 , . . . , en , and calculated for i =
where the ei are the residuals 1, . . . , n,
yi = 1 3xi + ei . (11)
ei = yi xi,1 1 xi,p p . (6) Thus, in this example, the parameters are
 
The covariance matrix of can, therefore, be esti- 1 1
mated by 2 = 3 . (12)
 3 0
(X X)1 2 .
Moreover, 2 = 1. Because this is a simulation, these
For example, the estimate of the variance of j is values are known.
To calculate the least squares estimator, we need

j ) = j2 2 ,
var( the values of X y, which, in this case, turn out to be

64.2007

where j2 is the j th element on the diagonal of X y = 52.6743 . (13)

(X X)1 . A confidence interval for j is now obtained 42.2025
by taking the least squares estimator j a margin: The least squares estimate is thus
 
0.5778
j c var(
j ), (7) = 2.3856 . (14)
0.0446
where c depends on the chosen confidence level. For
a 95% confidence interval, the value c = 1.96 is a From the data, we also calculated the estimated
good approximation when n is large. For smaller variance of the noise, and found the value
values of n, one usually takes a more conservative
2 = 0.883. (15)
c using the tables for the student distribution with
n p degrees of freedom. The data are represented in Figure 2. The dashed
line is the true regression f (x). The solid line is the
estimated regression f (x).
Numerical Example. Consider a regression with
The estimated regression is barely distinguishable
constant, linear and quadratic terms:
from a straight line. Indeed, the value 3 = 0.0446
of the quadratic term is small. The estimated variance
f (X) = 1 + X2 + X 2 3 . (8)
of 3 is
We take n = 100 and xi = i/n, i = 1, . . . , n. The 3 ) = 1.8009 0.883 = 1.5902.
var( (16)
matrix X is now
Using c = 1.96 in (7), we find the confidence interval
1 x1 x12
. .. .. 3 0.0446 1.96 1.5902 = [2.5162, 2.470].
X = .. . . . (9)
1 xn xn2 (17)
Least Squares Estimation 3

3 This is because 3 is correlated with 1 and 2 . One


data may verify that the correlation matrix of is
13x
2
0.57752.3856x 0.0446x2 
1 0.8708 0.7529
1 0.8708 1 0.9689 .
0.7529 0.9689 1
0
Testing Linear Hypotheses. The testing problem
1 considered in the numerical example is a special case
of testing a linear hypothesis H0 : A = 0, where A
2 is some r p matrix. As another example of such
a hypothesis, suppose we want to test whether two
3 coefficients are equal, say H0 : 1 = 2 . This means
there is one restriction r = 1, and we can take A as
4 the 1 p row vector
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 2 Observed data, true regression (dashed line), and A = (1, 1, 0, . . . , 0). (22)
least squares estimate (solid line)
In general, we assume that there are no linear
dependencies in the r restrictions A = 0. To test
Thus, 3 is not significantly different from zero at the the linear hypothesis, we use the statistic
5% level, and, hence, we do not reject the hypothesis
H0 : 3 = 0. X0 X2 /r
T2 = , (23)
Below, we will consider general test statistics for 2
testing hypotheses on . In this particular case, the
test statistic takes the form where 0 is the least squares estimator under H0 :
A = 0. In the numerical example, this statistic takes
32 the form given in (18). When the noise is normally
T2 = = 0.0012. (18) distributed, critical values can be found in a table
3 )
var(
for the F distribution with r and n p degrees of
Using this test statistic is equivalent to the above freedom. For large n, approximate critical values are
method based on the confidence interval. Indeed, in the table of the 2 distribution with r degrees of
as T 2 < (1.96)2 , we do not reject the hypothesis freedom.
H0 : 3 = 0.
Under the hypothesis H0 : 3 = 0, we use the least
squares estimator Some Extensions



1,0 
1  0.5854 Weighted Least Squares. In many cases, the vari-
= (X0 X0 ) X0 y = . (19) ance i2 of the noise at measurement i depends on xi .
2,0 2.4306
Observations where i2 is large are less accurate, and,
Here, hence, should play a smaller role in the estimation of

1 x1 . The weighted least squares estimator is that value
. ..
X0 = .. . . (20) of b that minimizes the criterion
1 xn n
(yi fb (xi ))2
.
It is important to note that setting 3 to zero changes i=1
i2
the values of the least squares estimates of 1 and 2 :


overall possible b. In the linear case, this criterion is
1,0 1 numerically of the same form, as we can make the
= . (21)
2,0 2 change of variables yi = yi /i and xi,j = xi,j /i .
4 Least Squares Estimation

The minimum 2 -estimator (see Estimation) is an where the number of parameters p is about as
example of a weighted least squares estimator in the large as the number of observations n. The curse of
context of density estimation. dimensionality in such models is handled by apply-
ing various complexity regularization techniques (see
Nonlinear Regression. When f is a nonlinear e.g., [2]).
function of , one usually needs iterative algorithms
to find the least squares estimator. The variance can References
then be approximated as in the linear case, with
f (xi ) taking the role of the rows of X. Here, [1] Green, P.J. & Silverman, B.W. (1994). Nonparametric
f (xi ) = f (xi )/ is the row vector of derivatives Regression and Generalized Linear Models: A Roughness
Penalty Approach, Chapman & Hall, London.
of f (xi ). For more details, see e.g. [4].
[2] Hastie, T., Tibshirani, R. & Friedman, J. (2001). The
Elements of Statistical Learning. Data Mining, Inference
Nonparametric Regression. In nonparametric re- and Prediction, Springer, New York.
gression, one only assumes a certain amount of [3] Robertson, T., Wright, F.T. & Dykstra, R.L. (1988). Order
Restricted Statistical Inference, Wiley, New York.
smoothness for f (e.g., as in [1]), or alternatively,
[4] Seber, G.A.F. & Wild, C.J. (2003). Nonlinear Regression,
certain qualitative assumptions such as monotonicity Wiley, New York.
(see [3]). Many nonparametric least squares proce-
dures have been developed and their numerical and SARA A. VAN DE GEER
theoretical behavior discussed in literature. Related
developments include estimation methods for models

You might also like