You are on page 1of 10

Recursive Least Squares Estimation

(Com 477/577 Notes)

Yan-Bin Jia
Dec 9, 2014

Estimation of a Constant

We start with estimation of a constant based on several noisy measurements. Suppose we have a
resistor but do not know its resistance. So we measure it several times using a cheap (and noisy)
multimeter. How do we come up with a good estimate of the resistance based on these noisy
measurements?
More formally, suppose x = (x1 , x2 , . . . , xn )T is a constant but unknown vector, and y =
(y1 , y2 , . . . , yl )T is an l-element noisy measurement vector. Our task is to find the best estimate
of x. Here we look at perhaps the simplest case where each yi is a linear combination of xj ,
x
1 j n, with addition of some measurement noise i . Thus, we are working with the following
linear system,
y = Hx + ,
where = (1 , 2 , . . . , l )T , and H is an l n matrix; or with all terms listed,

y1
H11 H1n
x1
1
.. ..
.. .. + .. .
..
. = .
.
. . .
yl
Hk1 Hkn
xn
l

, we consider the difference between the noisy measurements and the proGiven an estimate x
:
jected values H x
.
= y Hx
that minimizes the cost function
Under the least squares principle, we will try to find the value of x
J(
x) = T
)T (y H x
)
= (y H x
T Hy y T H x
+x
T HT Hx
.
= yT y x
The necessary condition for the minimum is the vanishing of the partial derivative of J with
, that is,
respect to x
J
= 2y T H + 2
xT H T H = 0.

The material is adapted from Sections 3.13.3 in Dan Simons book Optimal State Estimation [1].

We solve the equation, obtaining


= (H T H)1 H T y.
x

(1)

(H T H)1

The inverse
exists if l > n and H is non-singular. In other words, when the number
of measurements is no fewer than the number of variables, and these measurements are linearly
independent.
Example 1. Suppose we are trying to estimate the resistance x of an unmarked resistor based on l noisy
measurements using a multimeter. In this case,
y = Hx + ,

(2)

H = (1, , 1)T .

(3)

where
Substitution of the above into equation (1) gives us the optimal estimate of x as
x

= (H T H)1 H T y
1 T
H y
=
l
y1 + + yl
=
.
l

Weighed Least Squares Estimation

So far we have placed equal confidence on all the measurements. Now we look at varying confidence
in the measurements. For instance, some of our measurements of an unmarked resistor were taken
with an expensive multimeter with low noise, while others were taken with a cheap multimeter by
a tired student late at night. Even though the second set of measurements is less reliable, we could
get some information about the resistance. We should never throw away measurements, no matter
how unreliable they may seem. This will be shown in the section.
We assume that each measurement yi , 1 i l, may be taken under a different condition so
that the variance i of the measurement noise may be distinct too:
E(i2 ) = i2 ,

1 i l.

Assume that the noise for each measurement has zero mean and is independent. The covariance
matrix for all measurement noise is
R = E( T )
2
1
.. . .
= .
.

0
.. .
.
l2

as = (1 , 2 , . . . , l )T . We will minimize the sum of squared


Write the difference y H x
differences weighted over the variations of the measurements:
J(
x) = T R1 =

2l
22
21
.
+
+

+
12 22
l2
2

If a measurement yi is noisy, we care less about the discrepancy between it and the ith element
because we do not have much confidence in this measurement. The cost function J can be
of H x
expanded as follows:
)T R1 (y H x
)
J(
x) = (y H x
T H T R1 y y T R1 H x
+x
T H T R1 H x
.
= y T R1 y x
At a minimum, the partial derivative of J must vanish, yielding
J
= 2y T R1 H + 2
xT H T R1 H = 0.

x
Immediately, we solve the above equation for the best estimate of x:
= (H T R1 H)1 H T R1 y.
x

(4)

Note that the measurement noise matrix R must be non-singular for a solution to exist. In other
words, each measurement yi must be corrupted by some noise for the estimation method to work.
Example 2. We get back to the problem in Example 1 of resistance estimation, for which the equations are
given in (2) and (3). Suppose each of the l noisy measurements has variance
E(i2 ) = i2 .
The measurement noise covariance is given as
R = diag(12 , . . . , l2 ).
Substituting H, R, y into (4), we obtain the estimate

x
=

(1, . . . , 1)
l
X
1
2

i=1 i

!1

1/12

l
X
yi
2

i=1 i

..

1
1/12
..

. (1, . . . , 1)
2
1/l
1

..

y1
..
.
2
1/l
yl

Recursive Least Squares Estimation

Equation (4) is adequate when we have made all the measurements. More often, we obtain measurements sequentially and want to update our estimate with each new measurement. In this case,
according to (4)
the matrix H needs to be augmented. We would have to recompute the estimate x
for every new measurement. This update can become very expensive. And the overall computation
can become prohibitive as the number of measurements becomes large.
This section shows how to recursively compute the weighted least squares estimate. More
k1 after k 1 measurements, and obtain a new measpecifically, suppose we have an estimate x
surement y k . To be general, every measurement is now an m-vector with values yielded by, say,
k without solving equation (4)?
several measuring instruments. How can we update the estimate to x
3

A linear recursive estimator can be written in the following form:


y k = Hk x + k ,
k = x
k1 + Kk (y k Hk x
k1 ).
x

(5)

Here Hk is an m n matrix, and Kk is n m and referred to as the estimator gain matrix. We


k1 as the correction term. Namely, the new estimate x
k is modified from the
refer to y k Hk x
k with a correction via the gain vector. The measurement noise has zero mean,
previous estimate x
i.e., E( k ) = 0.
The current estimation error is
k
k = x x
k1 Kk (y k Hk x
k1 )
= xx
k1 )
= k1 Kk (Hk x + k Hk x
k1 ) Kk k
= k1 Kk Hk (x x
= (I Kk Hk )k1 Kk k ,

(6)

where I is the n n identity matrix. The mean of this error is then


E(k ) = (I Kk Hk )E(k1 ) Kk E( k ).
If E( k ) = 0 and E(k1 ) = 0, then E(k ) = 0. So if the measurement noise k has zero mean for
k = xk for all k. With
all k, and the initial estimate of x is set equal to its expected value, then x
this property, the estimator (5) is called unbiased. The property holds regardless of the value of
will be equal to the true value x.
the gain vector Kk . It says that on the average the estimate x
The key is to determine the optimal value of the gain vector Kk . The optimality criterion used
by us is to minimize the aggregated variance of the estimation errors at time k:
k k2 )
Jk = E(kx x
= E(Tk k )

= E tr(k Tk )
= Tr(Pk ),

(7)

where Tr is the trace operator1 , and the n n matrix Pk = E(k Tk ) is the estimation-error
covariance, Next, we obtain Pk with a substitution of (6):


T 
(I Kk Hk )k1 Kk k (I Kk Hk )k1 Kk k
Pk = E
= (I Kk Hk )E(k1 Tk1 )(I Kk Hk )T Kk E( k Tk1 )(I Kk Hk )T
(I Kk Hk )E(k1 Tk )KkT + Kk E( k Tk )KkT .
The estimation error k1 at time k 1 is independent of the measurement noise k at time k,
which implies that
E( k Tk1 ) = E( k )E(Tk1 ) = 0,
E(k1 Tk ) = E(k1 )E( Tk ) = 0.
1

The trace of a matrix is the sum of its diagonal elements.

Given the definition of the m m matrix Rk = E( k Tk ) as covariance of k , the expression of Pk


becomes
Pk = (I Kk Hk )Pk1 (I Kk Hk )T + Kk Rk KkT .
(8)
Equation (8) is the recurrence for the covariance of the least squares estimation error. It is
consistent with the intuition that as the measurement noise (Rk ) increases, the uncertainty (Pk )
increases. Note that Pk as a covariance matrix is positive definite.
What remains is to find the value of the gain vector Kk that minimizes the cost function given
by (6). The mean of the estimation error is zero independent of the value of Kk already. Thus
the minimizing value of Kk will make the cost function consistently close to zero. We need to
differentiate Jk with respect to Kk .
f
f
= ( a
).
The derivative of a function f with respect to a matrix A = (aij ) is a matrix A
ij
Theorem 1 Let C and X be matrices of the same dimension r s. Suppose C does not depend
on X. Then the following holds:
Tr(CX T )
X
Tr(XCX T )
X

= C,
= XC + XC T .

(9)
(10)

A proof of the theorem is given in Appendix A. In the case that C is symmetric, X


Tr(XCX T ) =
2XC. With these facts in mind, we first substitute (8) into (7) and differentiate the resulting
expression with respect to Kk :



Jk
=
Tr Pk1 Kk Hk Pk1 Pk1 HkT KkT + Kk (Hk Pk1 HkT )KkT +
Tr(Kk Rk KkT )
Kk
Kk
Kk

= 2
(Pk1 HkT KkT ) + 2Kk (Hk Pk1 HkT ) + 2Kk Rk
(by (10))
Kk
= 2Pk1 HkT + 2Kk Hk Pk1 HkT + 2Kk Rk
(by (9))

= 2Pk1 HkT + 2Kk (Hk Pk1 HkT + Rk )


In the second equation above, we also used that Pk1 is independent of Kk and that Kk Hk Pk1 and
Pk1 HkT KkT are transposes of each other (since Pk1 is symmetric). Setting the partial derivative
to zero, we solve for Kk :
(11)
Kk = Pk1 HkT (Hk Pk1 HkT + Rk )1 .
Write Sk = Hk Pk1 HkT + Rk , so
Kk = Pk1 HkT Sk1 .

(12)

Substitute the above for Kk into equation (8) for Pk . The operation followed by an expansion leads
to a few steps of manipulation as follows:
Pk = (I Pk1 HkT Sk1 Hk )Pk1 (I Pk1 HkT Sk1 Hk )T + Pk1 HkT Sk1 Rk Sk1 Hk Pk1
= Pk1 Pk1 HkT Sk1 Hk Pk1 Pk1 HkT Sk1 Hk Pk1 +
Pk1 HkT Sk1 Hk Pk1 HkT Sk1 Hk Pk1 + Pk1 HkT Sk1 Rk Sk1 Hk Pk1
5

= Pk1 Pk1 HkT Sk1 Hk Pk1 Pk1 HkT Sk1 Hk Pk1 + Pk1 HkT Sk1 Sk Sk1 Hk Pk1
(after merging the underlined terms into Sk )
= Pk1
= Pk1

2Pk1 HkT Sk1 Hk Pk1


Pk1 HkT Sk1 Hk Pk1

+ Pk1 HkT Sk1 Hk Pk1

= Pk1 Kk Hk Pk1

(13)
(by (12))

= (I Kk Hk )Pk1 .

(14)

Note that in the above Pk is symmetric as a covariance matrix, and so is Sk .


We take the inverses of both sides of equation (13) and plug into the expression for Sk . Expansion and merging of terms yield
1
Pk1 = Pk1
+ HkT Rk1 Hk ,

from which we obtain an alternative expression for the convariance matrix:


1
1
.
Pk = Pk1
+ HkT Rk1 Hk

(15)

(16)

This expression is more complicated than (14) since it requires three matrix inversions. Nevertheless, it has computational advantages in certain situations in practice [1, pp.156158].
We can also derive an alternate form for the convariance Pk as follows. Start with a multiplication of the right of (11) with Pk Pk1 . Then, substitute (15) for Pk1 into the resulting expression.
Multiply the Pk1 Hk factor inside the parenthesized factor on its left, and extract HkT Rk1 out of
the parentheses. The last two parenthesized factors will cancel each other, yielding
Kk = Pk HkT Rk1 .

(17)

The Estimation Algorithm

The algorithm for recursive least squares estimation is summarized as follows.


1. Initialize the estimator:
0 = E(x),
x

0 )(x x
0 )T .
P0 = E (x x

In the case of no prior knowledge about x, simply let P0 = I. In the case of perfect prior
knowledge, let P0 = 0.
2. Iterate the follow two steps.
(a) Obtain a new measurement y k , assuming that it is given by the equation
y k = Hk x + k ,
where the noise k has zero mean and covariance Rk . The measurement noise at each
time step k is independent. So,

0,
if i 6= j,
T
E( ) =
Rj , if i = j.
Essentially, we assume white measurement noise.
6

and the covariance of the estimation error sequentially according


(b) Update the estimate x
to (11), (5), (14), which are re-listed below:
Kk = Pk1 HkT (Hk Pk1 HkT + Rk )1 ,
k = x
k1 + Kk (y k Hk x
k1 ),
x

(18)

Pk = (I Kk Hk )Pk1 ,

(20)

(19)

or according to (16), (17), and (19):


Pk =
Kk =

1
Pk1
+ HkT Rk1 Hk

Pk HkT Rk1 ,

1

k = x
k1 + Kk (y k Hk x
k1 ).
x
Note that (19) and (20) can switch their order in one round of update.
Example 3. We revisit the resistance estimation problem presented in Examples 1 and 2. Now, we want
to iteratively improve our estimate of the resistance x. At the kth sampling, our measurement is
yk
Rk

=
=

Hk x + k = x + k ,
E(k2 ).

Here, the measurement vector Hk is a scalar 1. Furthermore, we suppose that each measurement has the
same covariance so Rk is a constant written as R.
Before the first measurement, we have some idea about the resistance x. This becomes our initial
estimate. Also, we have some uncertainty about this initial estimate, which becomes our initial covariance.
Together we have
x
0
P0

=
=

E(x),
E((x x
0 )2 ).

If we have no idea about the resistance, set P0 = . If we are certain about the resistance value, set P0 = 0.
(Of course, then there would be no need to take measurements.)
After the first measurement (k=1), we update the estimate and the error covariance according to equations (18)(20) as follows:
K1

x1

P1

P0
,
P0 + R
P0
x
0 +
(y1 x0 ),
P0 + R


P0 R
P0
P0 =
.
1
P0 + R
P0 + R

After the second measurement, the estimates become


K2

x2

=
=

P2

P1
P0
=
,
P1 + R
2P0 + R
P1
x
1 +
(y2 x
1 )
P1 + R
P0
P0 + R
x
1 +
y2 ,
2P0 + R
2P0 + R
P1 R
P0 R
=
.
P1 + R
2P0 + R

By induction, we can show that


Kk

xk

Pk

P0
,
kP0 + R
(k 1)P0 + R
P0
x
k1 +
yk ,
kP0 + R
kP0 + R
P0 R
.
kP0 + R

Note that if x is known perfectly a priori, then P0 = 0, which implies that Kk = 0 and x
k = x
0 , for all
k. The optimal estimate of x is independent of any measurements that are obtained. At the opposite end of
the spectrum, if x is completely unknown a priori, then P0 = . The above equation for x
k becomes,
x
k

=
=
=

P0
(k 1)P0 + R
x
k1 +
yk
kP0 + R
kP0 + R
k1
1
xk1 + yk
k
k

1
(k 1)
xk1 + yk .
k
lim

P0

The right hand side of the last equation above is just the running average yk =
ments. To see this, we first have
k
X
j=1

yj

k1
X

1
k

Pk

j=1

yj of the measure-

yj + yk

j=1

k1
X
1
= (k 1)
yj + yk
k 1 j=1

= (k 1)
yk1 + yk .

Since x
1 = y1 , the recurrences for x
k and yk are the same. Hence x
k = yk for all k.
Example 4. Suppose that a tank contains a concentration x1 of chemical 1, and a concentration x2 of
chemical 2. We have an instrument to detect the combined concentration x1 + x2 of the two chemicals but
not able to tell the values of x1 and x2 . Chemical 2 leaks from the tank so that its concentration decreases
by 1% from one measurement to the next. The measurement equation is given as
yk = x1 + 0.99k1 x2 + k ,
where Hk = (1, 0.99k1 )T , and k is a random variable with zero mean and a variance R = 0.01.
Let the real values be x = (x1 , x2 )T = (10, 5)T . Suppose the initial estimates are x
1 = 8 and x
2 = 7 with
P0 equal to the identity matrix. We apply the recursive least squares algorithm. The next figure2 shows the
evolutions of the estimates x
1 and x2 , along with those of the variance of the estimation errors. It can be
seen that after a couple dozen measurements, the estimates are getting very close to the true values 10 and 5.
The variances of the estimation errors asymptotically approach zero. This means that we have increasingly
more confidence in the estimates with more measurements obtained.
2

Figure 3.1, p. 92 of [1].

Proof of Theorem 1

Proof
Denote C = (cij ), X = (xij ), and CX T = (dij ). The trace of CX T is
r
X

dtt
t=1
r X
s
X

Tr(CX ) =
=

ctk xtk .

t=1 k=1

From the above, we easily obtain its partial derivatives with respect to the entries of X:

Tr(CX T ) = cij .
xij
This establishes (9).
To prove (10), we have

Tr(XCX T ) =
X



Tr(XCY T )
+
Tr(Y CX T )
X
Y =X X
Y =X



(by (9))
=
+Y C
Tr(Y C T X T )
X
Y =X
Y =X


= Y CT
+XC
Y =X

= XC T + XC.

References
[1] D. Simon. Optimal State Estimations. John Wiley & Sons, Inc., Hoboken, New Jersey, 2006.

10

You might also like