In this section we’re going to take a look at an important
application of orthogonal
projections to inconsistent systems of equations. Recall that a system is called inconsistent
if there are no solutions to the system.
The natural question should probably arise at this point of just why we
would care about this. Let’s take a look
at the following examples that we can use to motivate the reason for looking
into this.
The previous two examples were asking for pretty much the
same thing and in the first example we were able to answer the question while
in the second we were not able to answer the question. It is the second example that we want to look
at a little closer. Here is a graph of
the points given in this example.

We can see that these points do almost fall on a line. Without the reference line that we put into
the sketch it would not be clear that these points did not fall onto a line and
so asking the question that we did was not totally unreasonable.
Let’s further suppose that the four points that we have in
this example came from some experiment and we know for some physical reason
that the data should all lie on a straight line. However, inaccuracies in the measuring
equipment caused some (or all) of the numbers to be a little off. In light of this the question in Example 2 is
again not unreasonable and in fact we may still need to answer it in some way.
That is the point of this section. Given this set of data, can we find the
equation of a line that will as closely as possible (whatever this means…)
approximate each of the data points. Or
more generally, given an inconsistent system of equations, 
,
can we find a vector, let’s call it 
,
so that 
will be as close to b as possible (again, what ever this means…).
To answer this question let’s step back a bit and take a
look at the general situation. So, we
will suppose that we have an inconsistent system of n equations in m
unknowns, 
, so the coefficient matrix, A, will have size 
. Let’s rewrite the system a little and make
the following definition.
We will call 
the error
vector and we’ll call 
the error
since it will measure the distance between Ax and b for any vector x in 
(there are m
unknowns and so x will be in 
).
Note that we’re going to be using the standard Euclidean inner product
to compute the norm in these cases. The
least squares problem is then the following problem.
Solving this problem is actually easier than it might look
at first. The first thing that we’ll
want to do is look at a more general situation.
The following theorem will be useful in solving the least squares
problem.
Proof : For any
vector w in W we can write.
Notice that 
is a difference of vectors in W and hence must also be in W.
Likewise, 
is in fact 
,
the component of u orthogonal to W, and so is orthogonal to any vector in
W.
Therefore 
and 
are orthogonal vectors. So, by the Pythagorean Theorem we have,
Or, upon dropping the middle term,
Finally, if we have 
then we know that 
and so if we drop this term we get,
This is equivalent to,
and so we’re done.

So, just what does this theorem do for us? Well for any vector x in 
we know
that Ax will be a linear combination of the column vectors from A.
Now, let W be the subspace of 
(yes, 
since each column of A has n entries) that is
spanned by the column vectors of A. Then Ax will not only be in W (since it’s a linear combination of
the column vectors) but as we let x
range over all possible vectors in 
Ax will range over all of W.
Now, the least squares problem is asking us to find the
vector x in 
,
we’re calling it 
,
so that 
is smaller than (i.e. smaller norm) than all other possible values of 
,
i.e. 
. If we plug in for the definition of the
errors we arrive at.
With the least squares problem we are looking for the
closest that we can get Ax to b. However, this is exactly
the type of situation that Theorem 1 is telling us how to solve. The Ax range over all possible vectors in W and we want the one that is closed to
some vector b in 
. Theorem 1 tells us that the one that we’re
after is,
Of course we are actually after 
and not 
but this does give us one way to find 
. We could first compute 
and then solve 
for 
and we’d have the solution that we’re
after. There is however a better way of
doing this.
Before we give that theorem however, we’ll need a quick
fact.
Proof : From Theorem 8 in the
Fundamental Subspaces section we know that if 
has only the trivial solution then 
will be an invertible matrix. So, let’s suppose that 
. This tells us that Ax is in the null space
of 
,
but we also know that Ax is in the column space of A.
Theorem 7 from the
section on Inner Product Spaces tells us that these two spaces are orthogonal
complements and Theorem 6
from the same section tells us that the only vector in common to both must be
the zero vector and so we know that 
. If 
,

,
… , 
are the columns of A then we know that Ax
can be written as,
Then using 
we also know that,
However, since the columns of A are linearly independent this equations can only have the trivial
solution, 
.
Therefore 
has only the trivial solution and so 
is an invertible matrix.

The following theorem will now give us a better method for
finding the least squares solution to a system of equations.
Proof :
Let’s suppose that 
is a least squares solution and so,
Now, let’s consider,
However as pointed out in the proof of Theorem 1 we know
that 
is in the orthogonal complement of W.
Next, W is the column space of
A and by Theorem 7 from the section on
Inner Product Spaces we know that the orthogonal complement of the column space
of A is in fact the null space of 
and so, 
must be in the null space of 
.
So, we must then have,
Or, with a little rewriting we arrive at,
and so we see that 
must also be a solution to the normal system
of equations.
For the second part we don’t have much to do. If the columns of A are linearly independent then 
is invertible by Theorem 2 above. However, by Theorem 8 in the
Fundamental Subspaces section this means that 
has a unique solution.
To find the unique solution we just need to multiply both
sides by the inverse of 
.

So, to find a least squares solution to 
all we need to do is solve the normal system
of equations,
and we will have a least squares solution.
Now we should work a couple of examples. We’ll start with Example 2 from above.
|
Example 3 Use
least squares to find the equation of the line that will best approximate the
points  ,
 ,
 and  .
Solution
The system of equations that we need to solve from Example
2 is,

So, we
have,

The normal
system that we need to solve is then,

This is a
fairly simple system to solve and upon doing so we get,

So, the
line that best approximates all the points above is given by,

The sketch of the line and points after Example 2 above
shows this line in relation to the points.
|
|
Example 4 Find
the least squares solution to the following system of equations.

Solution
Okay there
really isn’t much to do here other than run through the formula. Here are the various matrices that we’ll
need here.


The normal
system if then,

This system
is a little messier to solve than the previous example, but upon solving we
get,

In vector
form the least squares solution is then,

|
We need to address one more issue before we move on to the
next section. When we opened this
discussion up we said that we were after a solution, denoted 
,
so that 
will be as close to b as possible in some way.
We then defined,
and stated that what we meant by as close to b as possible was that we wanted to
find the 
for which,
for all 
.
Okay, this is all fine in terms of mathematically defining
what we mean by “as close as possible”, but in practical terms just what are we
asking for here? Let’s go back to
Example 3. For this example the general
formula for 
is,
So, the components of the error vector, 
,
each measure just how close each possible choice of m and b will get us to
the exact answer (which is given by the components of b).
We can also think about this in terms of the equation of the
line. We’ve been given a set of points 
and we want to determine an m and a b so that when we plug 
,
the x coordinate or our point, into 
the error,

is as small as possible (in some way that we’re trying to
figure out here) for all the points that we’ve been given. Then if we plug in the points that we’ve been
given we’ll see that this formula is nothing more than the components of the
error vector.
Now, in the case of our example we were looking for,
so that,
is as small as possible, or in other words is smaller than
all other possible choices of x.
We can now answer just what we mean by “as small as
possible”. First, let’s compute the
following,
The least squares solution, 
,
will be the value of x for which,
and hence the name “least squares”. The solution we’re after is the value that
will give the least value of the sum of the squares of the errors.
|
Example 5 Compute
the error for the solution from Example 3.
Solution
First, the line that we found using least squares is,

We can
compute the errors for each of the points by plugging in the given x value into this line and then taking
the difference of the result form the equation and the known y value. Here are the error computations for each of
the four points in Example 3.

On a side
note, we could have just as easily computed these by doing the following
matrix work.

The square
of the error and the error is then,

|
Now, according to our discussion above this means that if we
choose any other value of m and b and compute the error we will arrive
at a value that is larger than 8.0125.