Metrics

There is more than one way to define the norm || u ||, and these correspond to different ways of measuring distance, i.e., to different metrics. Two of the most common are

Euclidean metric: || u || = sqrt( u₁² + u₂² + ... + u_d² )

Manhattan (or taxicab) metric: || u || = |u₁| + |u₂| + ... + |u_d|

In our template-matching example of classifying characters by counting the number of disagreements, we were implicitly using a Manhattan metric.* For the rest of these notes, we will use either the Euclidean distance or something called the Mahalanobis distance. We will see that

Contours of constant Euclidean distance are circles (or spheres)

Contours of constant Manhattan distance are squares (or boxes)

Contours of constant Mahalanobis distance are ellipses (or ellipsoids)

Covariance

The covariance of two features measures their tendency to vary together, i.e., to co-vary. Where the variance is the average of the squared deviation of a feature from its mean, the covariance is the average of the products of the deviations of feature values from their means. To be more precise, consider Feature i and Feature j. Let { x(1,i), x(2,i), ... , x(n,i) } be a set of n examples of Feature i, and let { x(1,j), x(2,j), ... , x(n,j) } be a corresponding set of n examples of Feature j. (That is, x(k,i) and x(k,j) are features of the same pattern, Pattern k.) Similarly, let m(i) be the mean of Feature i, and m(j) be the mean of Feature j. Then the covariance of Feature i and Feature j is defined by
c(i,j) = { [ x(1,i) - m(i) ] [ x(1,j) - m(j) ] + ... + [ x(n,i) - m(i) ] [ x(n,j) - m(j) ] } / ( n - 1 ) .
The covariance has several important properties:

If Feature i and Feature j tend to increase together, then c(i,j) > 0

If Feature i tends to decrease when Feature j increases, then c(i,j) < 0

If Feature i and Feature j are independent, then c(i,j) = 0

| c(i,j) | <= s(i) s(j), where s(i) is the standard deviation of Feature i

c(i,i) = s(i)² = v(i)

Thus, the covariance c(i,j) is a number between - s(i) s(j) and + s(i) s(j) that measures the dependence between Feature i and Feature j, with c(i,j) = 0 if there is no dependence. The correspondence between the covariance and the shape of the data cluster is illustrated below.

Covariance Matrix

All of the covariances c(i,j) can be collected together into a covariance matrix C:

This matrix provides us with a way to measure distance that is invariant to linear transformations of the data. Suppose that we start with a d-dimensional feature vector x that has a mean vector m_x and a covariance matrix C_x. If we use the d-by-d matrix A to transform x into y through
y = A x ,
it is not hard to show that the mean vector for y is given by
m_y = A m_x,
and the covariance matrix for y is given by
C_y = A C_x A' .
Suppose now that we want to measure the distance from x to m_x, or from y to m_y. We could, of course, use the Euclidean norm, but it would be very unusual if the Euclidean distance from x to m_x turned out to be the same as the Euclidean distance from y to m_y. (Geometrically, that would happen only if A happened to correspond to a rotation or a reflection, which is not very interesting.) What we want to do is to normalize the distance, much like we did when we defined the standardized distance for a single feature. The question is: What is the matrix generalization of the scalar expression

The answer turns out to be

.
If you know some linear algebra, you should be able to prove that this expression is invariant to any nonsingular linear transformation. That is, if you substitute y = A x and use the formulas above for m_y and C_y, you will get the very same numerical value for r, no matter what the matrix A is. Now, suppose there is a feature space in which the clusters are spherical and the Euclidean metric provides the right way to measure the distance from y to m_y. In that space, the covariance matrix is the identity matrix, and r is exactly the Euclidean distance from y to m_y. But since we can get to that space from the x space through a linear transformation, and since r is invariant to linear transformation, we can equally well compute r directly from

Invariants

We can use Invariants to recognise simple objects:

They provide a simple means of comparison.
They can provide position and orientation information (i.e. Moments etc.)
Too simplistic for some applications
Do not give a unique means of identification.

Invariant Measures

Several of the statistical measures we have met are invariant measures, which is to say that the value of the measure does not vary with, for example, the position of the region in the image, or perhaps its orientation or scale.
Thus, while the centre of gravity of a region obviously varies with its position, its area does not.
While area does vary with scale (and thus closeness of the camera to the object, for example), compactness as defined above is invariant with respect to scale as well as position and orientation. Invariant measures can be quite useful in recognising objects.
Other invariant measures we have met before are:

Moments of inertia are invariant with respect to scaling, rotation and translation.
Eccentricity is invariant with respect to scaling, rotation and translation.
Fourier Transforms are rotation invariant. We can use the Fourier transform to compute Fourier descriptors of an object which are invariant with respect to position and orientation.

Linear Discriminants

Recall that when we use a minimum-distance classifier to classify a feature vector x, we measure the distance from x to the templates m₁, m₂, ..., m_c and assign x to the class of the nearest template. Using the inner product to express the Euclidean distance from x to m_k, we can write

Notice that the x'x term is the same for every class, i.e., for every k. To find the template m_k that minimizes || x - m_k ||, it is sufficient to find the m_k that maximizes the bracketed expression, m_k' x - 0.5 m_k' m_k. Let us define the linear discriminant function g(x) by
g(x) = m' x - 0.5 || m ||² .
Then we can say that a minimum-Euclidean-distance clasifier classifies an input feature vector x by computing c linear discriminant functions g₁(x), g₂(x), ... , g_c(x) and assigning x to the class corresponding to the maximum discriminant function. We can also think of the linear discriminant functions as measuring the correlation between x and m_k, with the addition of a correction for the "template energy" represented by || m_k ||². With this correction included, a minimum-Euclidean-distance classifier is equivalent to a maximum-correlation classifier.