University of Ulster, Magee College. BSc. Applied Computing (1220), Year 4. Course: Image Processing (AC460). Autumn Term 1995. Lecturer: J.G. Campbell. Date: / /95 8. Pattern Recognition. ----------------------- 8.1 Introduction. ----------------- 8.1.1 Motivation. ---------------- Much of our interaction with our environment requires recognition of 'things': shapes in a scene (text characters, faces, animals, plants, etc...), sounds, smells, etc. If we are to automate certain tasks, clearly we must give machines some capability to recognise. And, obviously, from the point of view of this module we must pay attention to automatic recognition of patterns in images. Indeed, the term 'pattern-recognition' is strongly linked to image processing, particularly for historical reasons: the first studies in so-called pattern recognition were aimed at optical character recognition. However, the allusion to human recognition of 'things', in the initial paragraph, is merely by way of motivation; usually, it will not be helpful to try to imitate the human recognition process -- even if we calm to know how it is done! and no matter how much we would like to emulate its marvellous abilities. On the other hand, pattern recognition encompasses a very broad area of study to do with automatic decision making. Typically, we have a collection of data about a situation; completely generally, we can assume that these data come as a set of p numbers, {x1, x2, ...xp}; usually, they will be arranged as a tuple or vector, x = (x1, x2, ... xp)T. Extreme examples are: the decision whether a burgular alarm state is (intruder, no intruder) -- based on a set of radar, acoustic, and electrical measurements, whether the state of a share is (sell, no sell, buy) -- based on current market state, or, the decision by a bank on whether to loan money to an individual -- based on certain data gathered in an interview. Hence pattern recognition principles are applicable to a wide range of problems, from electronic detection to decision support. From the point of view of this chapter, and that of most pattern recognition theory, we define / summarise a pattern recognition system using the block diagram in Figure 8.1-1. +-------------------+ | | x ---->+Pattern Recognition+-----> w - | System | +-------------------+ Input: Data vector x Output: Class Label, w - Figure 8.1-1 A General Pattern Recognition System ------------------------------------------------- The input to the system is a vector, x = (x1, x2, ... xp)T, the output is a label, w, taken from a set of possible labels {w1, w2, ..., wC}. In the case of OCR (optical character recognition) of upper-case alphanumeric characters used in post-codes, we have w={w1, w2, ... w36}, i.e. 36 classes, one corresponding to each of the possible text characters. In the case of a burgular-alarm, or of a credit-worthiness evaluation system, we have just two classes -- alarm or quiet, loan or don't loan. Because it is deciding / selecting to which of a number of classes the vector x belongs, a pattern recognition system is often called a CLASSIFIER -- or a pattern classification system. For the purposes of most pattern recognition theory, a PATTERN is merely an ordered collection of numbers (just reiterating the comments of a few paragraphs back); this abstraction may seem a bit simple, but it is amazingly powerful and widely applicable. In addition, we will focus strongly on binary classification (two classes); although this may seem limiting, it is not, for any problem involving, in general C classes, can, in fact, be reduced to a succession of two-class problems. 8.1.2 Features & Feature Extraction. ------------------------------------ In the scheme described above, our p numbers could be simply 'raw' measurements -- from the burgular alarm sensors, or bank account numbers for the loan approval. Quite often it is useful to apply some 'problem dependent' processing to the 'raw' data -- before submitting them to the decision mechanism; in fact, what we try to do is to derive some data (another vector) that are sufficient to discriminate (classify) patterns, but eliminate all superfluous and irrelevant details (e.g. noise). This process is called FEATURE EXTRACTION. The crucial principles behind feature extraction are: 1. Descriptive and discriminating feature(s). 2. As few as possible of them -- leading to a simpler classifier. Figure 8.1-2 gives an expanded summary of a pattern recognition system, showing the intermediate stage of feature extraction. Actually, an earlier 'pre-processing' step may precede feature extraction, e.g. filtering out noise. +-----------------+ +---------------+ | | | | Observ- | Feature |Feature | Classifier |Deci --------->| Extraction |-------->| |---> -ation x' | |Vector x | |sion Vector | | | | w +-----------------+ +---------------+ Figure 8.1-2 Pattern Recognition System --------------------------------------- The observation vector (x') is the input, the final output is the desision (w). In this setting, the classifier can be made more general, and the feature extraction made to hide the problem specific matters. Hence, we can have a classifier algorithm that can be applied to many different problems -- all we need to do is change the parameters. 8.1.3 Classifier. ----------------- As already mentioned, is to identify, on the basis of the components of the feature vector (x), what class x belongs to. This might involve computing the 'similarity measure' between x and a number of stored 'perfect' / prototype vectors one or more for each class w, and choosing the class with maximum similarity. Another common scheme is to compute the mean (average) vector for each class, mi, i=1..C, and to propose these as the single prototype vector for the classes. If we equate 'closeness' with similarity, we can derive a simple but effective classifier which measures the distance (see chapter 3.2, and later in this chapter) between a pattern, x, and each of the mi: di = |x - mi| (note x & mi are vectors) and deciding class on the basis on MINIMUM di. Such a classifier is call NEAREST MEAN, and though simple in concept, has remarkably wide applicability. Mathematically speaking, the feature vectors form a FEATURE SPACE; and, formally, the job of the classifier is to partition this space into regions, each region corresponding to a class. These concepts are best illustrated by a numerical example. Assume that we are trying to discriminate between penny (class 0) and 20p coins (class 1); we have two measurements: weight, x1, and brightness, x2. Assume we have taken some measurements, and these are tabulated below. -------------------------------------- weight brightness class (y) x1 x2 0 0.40 1.50 0 1.00 0.50 0 1.00 1.50 0 1.00 2.50 0 1.60 1.50 1 1.40 3.00 1 2.00 2.00 1 2.00 3.00 1 2.00 4.00 1 2.60 3.00 Table 8.1-1 Measured Patterns. --------------------------------------- To emphasise the idea of a feature space, the data are tabulated in Figure 8.1-3. Notice how the situation is clarified, compared to what the table; feature space diagrams are important for gaining insight to such problems. On the other hand, when the dimension of x, p, is greater than two, we have problems visualising it. + | 4 + 1 | + * | m1 3 + *1 1 1 | + 0 | * 2 + 1 xa = (3,2) | m0 * + 0 0 0 0 = class 0, mean = m0 | 1 = class 1, mean = m1 1 + xb * * * * = possible boundary | xa = (3, 2), pattern of unknown class + 0 * xb = (1, 1), pattern of unknown class | +----+----+----+----+----+----+----> x1 0 1 2 3 Figure 8.1-3 Feature Space Diagram. ----------------------------------- The 20ps are heavier than the 1ps, but errors (noise) in the weighing machine make this unreliable on its own, i.e. you cannot just apply a threshold: x1 > 1.5 => class = 1 (20p) <=1.5 => class = 0 ( 1p) There would be an overlap, and some errors. Likewise brightness, on its own, is not a reliable discriminator. Now, we can test a number of classification rules, on two patterns whose classes are unknown, xa = (x1=3, x2=2), and xb = (1, 1). (a) Nearest neighbour. Find the 'training' / prototype pattern in the table which is closest to the pattern, and take its class; thus xb -> class 1, and xa -> class 0. (b) Nearest mean. The class means are m0 = (1.0,1.5), m1 = (2.0,3.0). Find the class mean closest to the pattern, and take its class; thus xb -> class 1, and xa -> class 0. (c) Linear Partition. Draw a straight line between the classes (in three dimensions it is a plane, in general p-dimensions it is a hyperplane); one side class 0, the other class 1. We will find that such a rule can be expressed as: a0 + a1.x1 + a2.x2 > 0 class 1 a0 + a1.x1 + a2.x2 <= 0 class 0 a0 + a1.x1 + a2.x2 = 0, the boundary. 8.1.5 Training & Supervised Classification. ------------------------------------------- The classifier rules we have described above, are all SUPERVISED classifiers, and the data in the table correspond to TRAINING data. Hence, we can have a subdivision of classifiers between supervised and UNSUPERVISED, i.e. whether or not the classifier rule is based on training data. The clustering / segmentation covered in Chapter 7 is unsupervised - the algorithm segments the image with no prior knowledge of the problem; i.e. in the example we did, the algorithm has not been 'trained' to recognise water pixels, and land pixels. On the other hand, if the algorithm is SUPERVISED, like all those covered in this chpater, it will have already been given example of water, and land, or 1ps and 20ps, i.e. TRAINED. In this chapter, we will say no more about unsupervised - see chapter 7. 8.1.6 Statistical Classification. --------------------------------- In the example of the 1ps and 20ps, we have made the pattern vectors lie in a nice symmetric distribution centred on the means. Usually, in a practical situation, the training data will form a cloud of points. And, maybe, there will be an overlap, i.e. we have to decide a 'best' boundary, whose errors are a minimum, yet not zero. Here, we can usually develop a decision rules based on statistical / probabilistic criteria. Classification is based on some decision theoretic criterion such as maximum likelihood. Note: statistical decision theory has a long history -- from before that of computers, starting around the early 1900's. In statistical pattern recognition, we usually adopt the model: pattern vector = perfect-pattern + noise; thus, for each class we have a cloud of points centred on the 'perfect-pattern' vector. The class means are often good estimates of the class perfect-vectors. The spread of each class 'cloud' / cluster will depend on the noise amplitude. Obviously, the degree to which the classes are separable depends on two factors: (a) the distance between the 'perfect- patterns', and the cluster spread. The major characteristic of the statistical pattern recognition approach is that it is 'low-level'. The image (pattern) is merely an array of numbers; the only model applied is that of the random process, there would be no difference in the process, whether the data represented scanned text characters, or parts on a conveyer belt, or faces, or, indeed, if all three were mixed together. Only the data, and estimated parameters would change. 8.1.7 Feature Vector -- Update. ------------------------------- The components of a pattern vector are commonly called features, thus the term feature vector introduced above. Other terms are attribute, characteristic. Often all patterns are called feature vectors, despite the literal unsuitability of the term if it is composed of raw data. It can be useful to classify feature extractors according to whether they are high- or low-level. A typical low-level feature extractor is a transformation Rp' -> Rp which, presumably, either enhances the separability of the classes, or, at least, reduces the dimensionality of the data (p < p') to the extent that the recognition task more computationally tractable, or simply to compress the data (see Chapter 6 -- many data compression schemes are used as feature extractors, and vice- versa. Examples of low-level feature extractors are: - Fourier power spectrum of a signal - appropriate if frequency content is a good discriminator; additionally, it has the property of shift invariance - it doesn't matter what co-ordinates the objects appears at in the image, - Karhunen-Loeve transform - transforms the data to a space in which the features are ordered according to their variance / information content, see Chapter 6. At a higher-level, for example in image shape recognition, we could have a vector composed of: length, width, circumference. Such features are more in keeping with the everyday usage of the term feature. A recent paper on face recognition has classified recognition techniques into: (a) geometric feature based, i.e. using high-level features such as distance between eyes, width of head at ears, length of feace, etc. (b) template based, where we use the whole of the face image, possibly Fourier transformed, or such-like. 8.1.8 Distance. --------------- In Chapter 8.1.6, we mentioned that statistical classifiers may use maximum likelihood (probability) as a criterion. In a wide range of cases, likelihood corresponds to 'closeness' to the class cluster, i.e. closeness to the centre / mean, or closeness to individual points. Hence, DISTANCE is an important criterion / metric. 8.1.9 Other Classification Paradigms. ------------------------------------- 8.1.9.1 Introduction. -------------------- Just to ensure that you are aware of the meanings of some terms that you may come across in textbooks, we will try to introduce some other paradigms -- though, for the most part, you can ignore them. 8.1.9.2 Structural Pattern Recognition. --------------------------------------- Because in statistical pattern recognition, patterns are viewed as vectors, the information carried by the pattern is no more than its position in pattern space; nevertheless, there may exist structural relationships between pattern components arising from shape, for example. Thus, STRUCTURAL pattern recognition has the richer model of a pattern as a structured collection of symbols. Here the distance metric is no longer the sole basis for recognition (nor is it really suitable); simple methods such as string matching may be applied, or syntactic / linguistic methods which are based on the model of pattern generation by one of a family of grammars (classes); the recognition task usually involves parsing to find which grammar generated the unidentified pattern. In relational pattern matching, the symbols are stored in a relational graph structure and the recognition process involves graph matching. Usually, there exists 'structure' in the pattern (e.g. in an image shapes formed by conjunctions of specific grey level values in adjacent regions, or, in a signal, peaks, troughs, valleys, etc.) which is not extracted when the pattern is treated as a mere p- dimensional vector; obviously, if the structure is the major discriminatory characteristic, and the position in p-space is relatively insignificant, this form of recognition will not work well. As an example, the letter 'A' might be described, not as a vector containing invariant moments - or grey levels, but as as SENTENCE: stroke-diagonal1 stroke- horizontal stroke-diagonal2 i.e. a sentence constructed of three PRIMITIVES. Classification is then a done by PARSING the sentence. Note: the primitives would probably be obtained by conventional pattern recognition. Syntactic pattern recognition is difficult to understand and apply - this may explain the comparative lack of interest in it. 8.1.9.3 Knowledge-based Pattern Recognition. -------------------------------------------- Even with the inclusion of structure, the knowledge (pattern) representation is may be low-level; the identification of the structure is probably done automatically and abstractly. On the other hand, if we bring human knowledge to bear, we can call the approach 'KNOWLEDGE-based'. An possible justification of this step is to say that we are performing interpretation of the pattern (image) rather than mere recognition of low-level features. Knowledge-based systems form a branch of artificial intelligence; to some extend they represent a milder form of 'expert-system' - with, perhaps, the aims slightly lowered. Knowledge-based systems try to automate the sort of complex decision task that confronts, for example, a medical doctor during diagnosis. No two cases are the same, different cases may carry different amounts of evidence. In essence, the doctor makes a decision based on a large number of variables, but some variables may be unknown for some patients, some may pale into insignificance given certain values for others, etc. This process is most difficult to codify. However, there are sufficient advantages for us to try: if we could codify the expertise of a specialist, waiting lists could be shortened, the expertise could be distributed more easily, the expertise would not die with the specialist, etc. There are four major parts in a knowledge based system: - KNOWLEDGE ELICITATION: this is the extraction of knowledge from the expert; it may be done by person to person interview, by questionairre, or by specialised computer program, - KNOWLEDGE REPRESENTATION: we must code the knowledge in a manner that allows it to be stored and retreived on a computer, - KNOWLEDGE DATABASE: where the knowledge is stored (using the 'representation' code mentioned above), - INFERENCE ENGINE: this takes data and questions from a user and provides answers, and/or updates the knowledge database. Figure 8.2-2 depicts a possible organisation and operation of a knowledge based system. Actually, a well designed database with a good database management system, coupled with a query language that is usable by non- computing people, goes a long way to fulfilling the requirements of a knowledge based system. Also, some of the pattern recognition systems we mention, could be called knowledge based - after all, they store knowledge (the training data or some summary of it) and make inferences (based on the measured data or feature vector); feature extraction is very similar, in principle, to knwoledge representation. Furthermore, we note that neural networks show much promise as knowledge-based systems. +--------------+ | | | DATABASE | | +<--------+ +----+---------+ | ^ | | | v v E +--------------+ +----+---------+ +----+---------+ X | KNOWLEDGE | | KNOWLEDGE | | INFERENCE | P--->+ ELICITATION +--->+REPRESENTATION+<---+ ENGINE | E | | | | | | R +--------------+ +--------------+ +-+------+-----+ T ^ | examples, | | raw knowledge New Data| | Questions, | v etc. Answers Figure 8.1-4 Knowledge Based System ----------------------------------- 8.1.9.4 Neural Networks. ------------------------ Automated pattern recognition was the original motivation for NEURAL NETWORKs. Here, we simply replace the 'pattern recognition' black box in Figure 8.1-2 with a neural network. In essence, neural network pattern is closely related to statistical. Neural networks are covered in some detain in Chapter 9. 8.1.10 Summary and Conclusions. ------------------------------- A pattern is simply an ordered collection of numbers -- usually expressed in the form of a pattern vector. The components of a pattern vector are commonly called features -- even if the pattern consists merely of data straight from a sensor. However, the recognition problem can usually be simplified by extracting features from the raw data, usually reducing the size of the feature vector, and removing irrelevancies. Use of an appropriate & effective feature extractor can allow us to apply a quite general classifier; the feature extractor solves the application specific problems, hence we arrive at an abstract pattern / feature vector and the classifier changes from application to application only by its parameters. It is useful to view feature vectors as forming a p-dimensional space; obviously for p greater than two, this can be hard to visualise exactly, nevertheless the idea of points in space interpretation is strongly recommended. With the concept of feature space comes the concept of distance as a critierion for measuring the similarity between patterns. 8.2 A Simple but Practical Problem. ---------------------------------- 8.2.1 Introduction. ------------------ So, what is a typical pattern recognition problem in computer-based picture processing? The following problem looks deceptively easy. 8.2.2 Naive Character Recognition. ---------------------------------- Figure 8.2-1 shows a letter 'C' in a small (3 x 3) part of a digital image (a digital picture). A digital picture is represented by brightness numbers (pixels) for each picture point. [It is possible, in a very simple model of visual sensing, to imagine each of these nine cells corresponding to a light sensitive cell in the retina at the back of your eye]. Pixels 1,1 1,1 1,3 +----+----+----+ |****|****|****| |****|****|****| +----+----+----+ 2,1 |****|2,2 |2,3 | |****| | | +----+----+----+ |****|****|****| |****|****|****| +----+----+----+ 3,1 3,2 3,3 Figure 8.2-1 A Letter C ----------------------- Assume that the picture has been quantised -- into ink (1) and background (0). How can we mechanise the process of recognition of the 'C'? Certainly if we can do it for one letter, we can do it for all the others - by looping over all the characters. The naive approach tackles the problem almost like that of a lookup table: 1. Pixel by Pixel comparison: Compare the input (candidate) image pixel for pixel, with each perfect letter, and select the character that fits exactly: for j = 1..nletters match = true for r= 0..NR-1 do for c= 0..NC-1 do if(f[r,c] != X[j][r,c]) /*r,cth pixel of jth letter match = false exit for r /*i.e. just ONE mismatch causes complete FAIL endif endfor c endfor r if(match==true) return j /* j is label endfor j return NULL As we will see later, this naive approach has many shortcomings - especially, in general, pairwise comparisons of collections numbers is usually fraught with difficulty: what if just one is wrong - by just a small amount? what if two dissimilar, by an even smaller amount? is one pixel more important than the others? So, we're uncomfortable with comparison of individual elements of the pattern. Can we compare the full vectors. Yes, in fact much of classical pattern recognition is about the comparison of vectors, finding distances between them, etc. Hence, to make this into a general pattern recognition problem, we represent the nine pixel values as elements of a vector (array), i.e. we collect them together in an array; assuming the character is white-on-black and that bright (filled in with '*') corresponds to '1', and dark to '0', the array corresponding to the 'C' is x[0]=1, x[1]=1, x[2]=1, x[3]=1, x[4]=0, x[5]=0, x[6]=1, x[7]=1, x[8]=1. Note: from now on, in the usual fashion, we will index vector elements from 0 to p-1. 'Feature' number: 0 1 2 +----+----+----+ |****|****|****| |****|****|****| +----+----+----+ 3 |****|4 |5 | |****| | | +----+----+----+ |****|****|****| |****|****|****| +----+----+----+ 6 7 8 Figure 8.2-2 A Letter C ----------------------- The letter 'T' would give a different observation vector: 'T': 1,1,1, 0,1,0, 0,1,0 'O': 1,1,1, 1,0,1, 1,1,1 'C': 1,1,1, 1,0,0, 1,1,1 etc... So how is the recognition done? We could base decision making on a set of vectors (X[j]) corresponding to 'perfect' characters ('T', 'C' etc.); let these be denoted by X[1] - C , X[2] - T, etc. Let the input (unknown) be x - a 9 dimensional vector; and let us assume that the components can be real, i.e. no longer binary; let them be any real number between 0 and 1, and subject to noise. The recognition system needs to be invariant (tolerant) to noise. What if there is a minor change in grey level? grey Cs are the same as white Cs : the system needs to be amplitude invariant - tolerant to changes in amplitude. 2. Maximum Correlation - Template Matching: Compute the correlation (match) of x with each of the X[j][.] and choose the character with maximum correlation: maxcorr = 0 maxlett = NULL for j = 1..nletters corr = 0 for i= 0..p-1 do corr = corr + X[j][i]*x[i] endfor if(corr > maxcorr) maxcorr = corr maxlett = j endfor return j That is, we choose letter with maximum corr(elation) or match; this is described mathematically as follows: (8.2-2) corrj = x . Xj - - i.e. the dot product of X and Xj. p-1 (8.2-3) = sum xi . xipj i= 0 the dot product of x with the jth 'perfect' letter. This is called TEMPLATE MATCHING because we are matching (correlating) each template (the 'perfect' X[j][.]s), and choosing the one that matches best. Template matching is more immune to noise - we expect the ups and downs of the noise to balance one another. Template matching can be made amplitude invariant by NORMALISING the correlation sum of products: (8.2-4) corr = x.X[j]/sqrt(x.x * X[j].X[j]) above the x and X[j] are p dimensional vectors. This is tantamount to equalising the total 'brightness' in each character. We are still far from having a perfect character-recogniser: what happens if the character moves slightly up, or down, left or right. 3. Shifting Correlation: Use two-dimensional correlation, as described in Chapter 4, i.e. move the character successively across and down, find the position of maximum correlation (match), then compare that maximum with the maxima for the other characters. But now, what about rotation? 4. We could put a loop around the shifting correlation, each time rotating the character a little. What happens if the size of the character changes - it's still the same character - the observation vector will change radically. We require scale invariance. 5. Now, we could loop over scale factor. By now, I hope you have realised that things have fairly well got out of hand; and we haven't mentioned small variations in character shape. THE MAIN POINT IS: this sort of AD HOC adding to, and changing of algorithms rarely ever works. We need to build a general theory which is applicable to a wide variety of problems. 8.2.3 Invariance for Two-Dimensional Patterns. --------------------------------------------- From the forgoing discussion, we can summarise some of the requirements of systems system for recognising two-dimensional shapes in images: - noise tolerance - amplitude invariance - shift invariance - rotation invariance - scale invariance. Note that, in the examples given above, 'C' differs from 'O' by only one element - so, naturally, 'C' is likely to be confused with 'O' and vice-versa. There is very little we can do about that - except to change the character shapes! - this is why they use funny shaped characters on cheques. 8.2.4 Feature Extraction -- Another Update. ------------------------------------------- Let us say we have a simple two-class subset of the previous character recognition problem: 'C' versus 'O'. Clearly, the only attribute of discriminating value is the one in which they differ - - number 5. Hence, a sensible feature extractor extracts this, call it y; we end up with a scalar (one-dimensional) pattern, and the classifier becomes a simple thresholding: y > 0.5 class O y <=0.5 class C 8.3 Classification Rules. ------------------------- 8.3.1 Introduction ------------------ In this section we discuss the common classification rules used with feature-based classifiers. In summary, roughly speaking, a supervised classifier involves: - TRAINING: gathering and storing example feature vectors - or some summary of them, - OPERATION: extracting features, and classifying, i.e. by computing similarity measures, and either finding the maximum, or applying some sort of thresholding. When developing a classifier, we distinguish between TRAINING data, and TEST data: - training data are used to 'train' the classifier - i.e. set its parameters, - test data are used to check if the trained classifier works, especially if it can generalise to new / unseen data. Although it is normal enough to do an initial test of a classifier using the same training data, in general it is a dangerous practice - we must check if the classifier can generalise to data it has not seen before. Example Data. ------------ The following 'toy' data will be used to exemplify some of the algorithms. It is two-dimensional, two-class. If you run DataLab batch file 'ip2dcl', that will set up the configuration and input the data for you, and show some histograms and scatter plots. The data are in file dl\dat\toy4.asc. Here it is - DataLab output: Type = REAL Bounds = Rows: 0 - 0, Cols: 0 - 9, Dims: 0 - 1. Number of classes: 2 The 2 labels are: 1 2 Class frequencies: 5 5 Prior probabilities: 0.500000 0.500000 Class means ----------- Class 1: 1.000000 1.500000 Class 2: 2.000000 3.000000 DataLab IMage Data ------------------ [0, 0] label cllab x0 x1 1 1 : 0.40 1.50 1 1 : 1.00 0.50 1 1 : 1.00 1.50 1 1 : 1.00 2.50 1 1 : 1.60 1.50 2 2 : 1.40 3.00 2 2 : 2.00 2.00 2 2 : 2.00 3.00 2 2 : 2.00 4.00 2 2 : 2.60 3.00 Here is another set of data, which would be typical of data from the same class, but with different noise. These data are in file 'dl\dat\toy41.asc'. (These data are TEST data). DataLab I-01 Type = REAL Bounds = Rows: 0 - 0, Cols: 0 - 9, Dims: 0 - 1. Number of classes: 2 The 2 labels are: 1 2 Class frequencies: 5 5 Prior probabilities: 0.500000 0.500000 [0, 0] 1 1 : 0.80 2.00 1 1 : 1.20 1.80 1 1 : 1.40 2.20 1 2 : 1.60 2.40 1 2 : 1.80 2.60 2 2 : 1.60 2.60 2 2 : 2.20 3.20 2 2 : 2.40 3.40 2 2 : 2.60 3.60 2 2 : 2.80 3.80 8.3.2 Similarity Measures Between Vectors. ----------------------------------------- Two possibilities are CORRELATION, and DISTANCE; these are closely related, and in most cases give identical results. See Chapter 3, sections 3.2.15 - 18 for the mathematical foundation. 1. Correlation. Correlation is a very natural and obvious form of similarity measure. It is sometimes called 'template matching' because of the similarity with comparing a two-dimensional shape with a template. The correlation between two vectors is given by the dot product of them (recall section 8.1). It is usually neccessary to normalise the vectors, which is the purpose of the denominator in eqn. 8.3-1. (8.3-1) cj = x.Xj/sqrt(x.x * Xj.Xj) Here cj gives the correlation (or match) between our unknown x and Xj, the jth 'perfect' vector. The dot product can be written as a summation (see also section 8.1 for an algorithm): p-1 (8.3-2) x.Xj = sum xi*Xji, i=0 where xi is the ith component of vector x and Xji is the ith component of 'perfect' vector Xj (class j) for p dimensional vectors. Usually, classification would involve finding the Xj that has the MAXIMUM correlation value - the output from the classifier (the class) is then j. Later, we shall see that correlation rules are at the basis of most neural networks. 2. Distance. Distance is another natural similarity measure. Points (feature vectors) that are close in feature space are similar, those far apart are dissimilar. The distance between two vectors is given by eqns. 8.3-3, 4 and 5. (8.3-3) D(x,Xj) = |x-Xj| p-1 (8.3-4) De(x,Xj) = sqrt[ sum (xi - Xji)^2] i=0 Eqn. 8.3-4 gives the 'true' Euclidean distance, i.e. that would be obtained using a piece of string! There are other distances, the most notable being, the Hamming (or Manhattan, or city-block) distance: p-1 (8.3-5) Dm(x,Xj) = sum (xi - Xji) i=0 Why is it called Manhattan / city-block? Ans. Think of getting from point a to point b in New York -- you have to go up, then across, ... Eqn. 8.3-5 requires no multiplication, and this explains its popularity in the days before hardware multipliers and arithmetic co-processors. Using distance for classification involves finding the MINIMUM distance. Actually, maximum normalised correlation can be shown to be EXACTLY EQUIVALENT to minimum Euclidean distance - see Chapter 3.2.18. Distance is used more often than correlation. However, even when minimum distance is used, most writers call it template matching; for those unfamiliar with the similarity mentioned in the previous paragraph, this use of terminology can cause confusion. 8.3.3 Nearest Mean Classifier. ------------------------------ (We have already encountered a nearest mean classifier in the k-means clustering algorithm in Chapter 7, and in Chapter 8.1) No matter how 'perfect' a 'pattern' may seem, it is usually unadvisable to classify on the basis of a single archetype. It is better to use the average (mean) of a set of examples (sample). Thus the nearest mean classifier. Training a nearest mean classifier involves collecting a set of training patterns, and computing the mean vector for each class: nj (8.3-6) mji = (1/nj) sum xjik k=1 where xjik is the kth example for component i, and class j Classification then involves computing minimum distance, with mji substituted for Xji in eqn. 8.3-4 : p-1 (8.3-7) De(x,mj) = sqrt[ sum (xi - mji)**2] i=0 Example. 8.3-1. Train a nearest mean classifier on data set 'toy4.asc', and test it on the same data; note: the class means are class 1 = (1.0,1.5), class 2 = (2.0,3.0) Here is the output, where label = true label, and cllab = classifier label. It gets them all correct, surprise, surprise! label cllab x0 x1 1 1 : 0.40 1.50 1 1 : 1.00 0.50 1 1 : 1.00 1.50 1 1 : 1.00 2.50 1 1 : 1.60 1.50 2 2 : 1.40 3.00 2 2 : 2.00 2.00 2 2 : 2.00 3.00 2 2 : 2.00 4.00 2 2 : 2.60 3.00 Example. 8.3-2. Train on the same data, but test on data set 'toy41.asc'. The results are below. Note the two errors, marked (*). That is, point (1.6,2.4) is closer to (2.0,3.0) than it is to (1.0,1.5). So is (1.8,2.6). [0, 0] 1 1 : 0.80 2.00 1 1 : 1.20 1.80 1 1 : 1.40 2.20 1 2* : 1.60 2.40 1 2* : 1.80 2.60 2 2 : 1.60 2.60 2 2 : 2.20 3.20 2 2 : 2.40 3.40 2 2 : 2.60 3.60 2 2 : 2.80 3.80 Exercise 8.3-3. Verify that the nearest mean classifier will give the results below. (a) 1 1 : 1.40 2.20 (b) 1 2* : 1.60 2.40 (c) 1 2* : 1.80 2.60 Ans. Use: (8.3-7) De(x,mj) = sqrt[ sum (xi - mji)^2] i=0 (a) Take (1.4,2.2); m1 =(1.0,1.5), m2 =(2.0,3.0) Let us forget about the sqrt(), since minimum squared distance is the same as minimum distance. Call the squared distance Ds(). Ds(x,m1) = (1.4-1.0)^2 + (2.2-1.5)^2 = 0.4^2 + 0.7^2 = 0.16 + 0.49 = 0.65 Ds(x,m2) = (1.4-2.0)^2 + (2.2-3.0)^2 = 0.6^2 + 0.8^2 = 0.36 + 0.64 = 1.0 Therefore, m1 is closest, so the class is 1. (b) (1.60 2.40) Ds(x,m1) = (1.6-1.0)^2 + (2.4-1.5)^2 = 0.6^2 + 0.9^2 = 0.36 + 0.81 = 1.17 Ds(x,m2) = (1.6-2.0)^2 + (2.4-3.0)^2 = 0.4^2 + 0.6^2 = 0.2 + 0.36 = 0.56 Therefore, m2 is closest, so the class is 2. (c) (1.80 2.60) (left as exercise). (d) Plot these data on a two-dimensional surface -- and verify from this feature space diagram that the results are plausible. 8.3.4 Nearest Neighbour Classifier. ----------------------------------- If the classes occupy irregularly shaped regions in feature space, the nearest mean classifier may not perform well, i.e. the mean does not provide an adequate summary of the class. One solution is to store all the example patterns - no summarising - and in classification go through all examples, and choose the class of the closest example. Although this sounds crude, the nearest neighbour classifier has very good theoretical credentials, and performs well in practice. The major problem is the amount of processing. If the dimensionality of the feature vector is large (p = 50, say) then we may need a large amount of examples, and so a large amount of distance computations for each classification. 8.3.5 Condensed Nearest Neighbour Algorithm. -------------------------------------------- (see Duda and Hart) If the nearest neighbour rule requires excessive computation, it is possible to 'condense' the set of example vectors to those at the boundaries the the class regions - these are the only ones that affect the final result. 8.3.6 k-Nearest Neighbour Classifier ------------------------------------ The performance of the nearest neighbour rule can sometimes be improved (especially if examples are sparse) by finding k nearest neighbours (e.g. k = 5) and taking a vote over the classes of the k. 8.3.7 Box Classifier. --------------------- Again, if the mean does not provide a good summary, it may be possible to describe the class region as a 'box' surrounding the mean. The classification rule then becomes: for each class j = 1..c for each dimension i = 0..p-1 if(lXji < xi < uXji) then INj = true else INj = false where lXji, uXji represent the lower and upper boundaries of the box. This is precisely the algorithm mentioned in example 1.4-2 (see chapter 1), albeit using less glamorous terminology. The standard deviation for class j, in dimension i gives an objective method for determining the bounds - see example 1.4-3. Example 8.3-3. Here are the data given in section 8.1, with an appropriate 'box' drawn to enclose class 1; actually, for the case shown it just about works, but in general, these 'box' shapes can be improved on -- see next section. Box classifier are easy to compute, however, and were very popular in the days of special- purpose hardware image processing and classification. + | +-------+ 4 + | 1 | | | | + | | | | m1 | 3 + |1 1 1| | | | + 0| | | | | 2 + | 1 | | m0+-------+ + 0 0 0 0 = class 0, mean = m0 | 1 = class 1, mean = m1 1 + | + 0 | +----+----+----+----+----+----+----> x1 0 1 2 3 Figure 8.3-1 Feature Space Diagram. ----------------------------------- 8.3.8 Hypersphere Classifier. ----------------------------- Instead of a 'box' - a hyperrectangle - we define a hypersphere, of radius rj, surrounding the mean mj. The classification rule then becomes: "if the distance between vector x and mj is less than rj, then the class is j" 8.3.9 Statistical Classifier. ----------------------------- Consider the noisy image of (say) a mark on a page shown in Figure 8.3-2. And assume that all we want to do is classify each pixel according to ink/no-ink (two classes). 0123456789012345678901234567890123456789012345678901 ---------------------------------------------------- 0|-/,/.--. -,,,-.-.-,-,,-/-,,,---,-,./.,,----,--/-,,,,| 1|-,/--,,//-,/,--.,-//,--,/-/,,-,,,--.-,,.--,---/,./.,| 2|,-,./,/..---/.--..-.-,,,.-=--,,,,,--,-,,---.,,--,-. | 3|//./,,,-. .-,,,,,-.,,,,,--,,--,,,,-,,---,./,.-,,.,,,| 4|,,-,--,,,-,.,,,, ,,,-,-X=+X=X+---,,,,,-,./-,,-,,,---| 5|---,.-,,,./,-,,,--,.,-,+=X=BX+-,--,,,-,,-=--/-,.,--.| 6|--/,./,/.=,/-.,-.,---,-+++=X=+,,-,-,,--.-,,-,/,,-/,.| 7|,..--..---,-,,,-,.-,,-,B=/+XXX,,.,-/-/,/-,-,,,/-/,,-| 8|-,-,.,/-.,-..,--,,,,-,./++==XX.-,/-/--,/-,-.,,,--/,-| 9|-..//-,,/,-....-,,,,,,.=+X=BXX--,-,/.,/--,--.-.-.-,-| 0|,-.-=XXX++=+BXBX+X+BX++=+X+X=X+++X=XX++=++X+X++X,,--| 1|.-.-++X+X+X+XXBBB+X+X+BX+=+=XXXX+X+XXXXX+++BXX++,-,,| 2|----+X+X=+X+XX=+X/XX+XX=+BXX+XBX+X+=++++++X+X++X,,-,| 3|,,-,+++B=X=+X+X+X+XXB+XXX=XXB+=+X+=MX++B==X+M+XX,/,,| 4|-,/-+X++X=+B=XX+XX+XB+BB+BX++BB==B++X=++XXB=XXB+.,.-| 5|--,,+XXX+XXX==X++++++++XXXX+=++=X+B=X=++XXX++=XX,,,,| 6|,,,-+XBXXX++++++B+=+BX+XXXXXB=+=X+=+X+++++++M++=.---| 7|-,-.,--,.-/-,-/,.-,,,,,X+X+XBX,--,,,--,,,-,--.-.--..| 8|-,.--/,,,-/,,.,.,-.,---+=+XBXX-,/--.-,--,-.,.-,,,.--| 9|-,-,,-,/,,,/--./,.-,/,,XX+XX++--,,.--,-,--,,--.,/,,-| 0|.. ,,-.--,/,/.---.--/.,++X+X+X,,--,,-- ,..--,,-,/-.,| 1|,/--,-.--,---,-..,,...,XBX+++=--,,.,-,---,,,,-.,--.-| 2|,-,,.,...,,-/--,,--. -,--,,.,/---.,,-,,,--=,,.,-.,./| 3|,,--,.,--,-.,,--.,--,.--/-,-.-.-.--,,.-,-,.,,-/,,,--| 4|,.,-,.,,,-,,.---,---,,/,.-.,-/--.-./-/--.,--,,-,-,--| 5|.,,/,,,/,-,-,-.-,-,---.,,/,-,,.,-.,,,,-,--,,.-,.--,,| +----------------------------------------------------+ +0123456789012345678901234567890123456789012345678901 LISSP I-03:00 08/01/92 18:07 [11,0] 76 106 76 116 179 189 203 189 217 195 217 191 199 201 231 231 223 195 203 195 205 193 225 219 185 159 195 173 207 201 205 201 189 221 185 201 219 209 203 199 197 181 187 239 203 201 191 179 82 118 86 84 [12,0] 118 108 124 116 181 205 177 211 163 175 217 189 205 207 173 195 211 147 199 213 197 205 213 167 183 233 217 201 197 201 223 221 185 207 197 157 195 187 195 193 189 181 201 195 221 195 183 205 82 84 116 94 [13,0] 98 88 104 84 197 177 197 225 169 203 155 195 203 187 201 187 201 193 199 209 225 177 199 199 199 173 221 203 225 193 171 187 211 191 169 247 209 195 195 229 165 161 209 189 247 197 209 203 80 126 96 90 Figure 8.3-2 Noisy Image of Mark on a Page. ------------------------------------------ A histogram / probability distribution of the image would look something like Figure 8.3-2: z p(z) 0 +-----------------------------------> | ... 60|* 70 |*** 80 |***** 90 |****** 100|***** 110|** 120|* ... 160|@ 170|@@@ 180|@@@@@ 190|@@@@@@ 200|@@@@@ 210|@@ 220|@ ... * represent non-ink, pn(z) @ represents ink, pi(z) Figure 8.3-2 Histogram of Grey Levels ------------------------------------- In Figure 8.3-2 the pi(z), and pn(z) can be taken to represent the probability, at z, of ink, and non-ink, respectively. This suggests a classifier rule (z is the input pattern, i.e. one dimensional feature): if (pi(z) > pn(z)) then ink else non-ink I.e. maximum probability, or maximum likelihood. This is the basis of all STATISTICAL PATTERN RECOGNITION. In the case under discussion, TRAINING the classifier simply involves histogram estimation. Unfortunately, except for very small dimensionalities, histograms are hard to measure. Usually we must use PARAMETRIC representations of probability density. Quite often, a maximum likelihood classifier turns out to be remarkably similar in practice to a minimum distance (nearest mean) classifier; the reason for this is that likelihood / probability is maximum at the mean, and tapers off as we move away from the mean. 8.3.10 Bayes Classifier. ------------------------ [Readers with a strong dislike of statistics may avoid this section] Assume two-classes, w0, w1. Assume we have the two probability densities p0(x), p1(x); these may be denoted by p(x|w0), p(x|w1) the class conditional probability densities of x. Another piece of information is vital: what is the relative probability of occurrence of w0, and w1: these are the PRIOR probabilities, P0, P1 - upper-case Ps represent priors. In this case the 'knowledge' of the classifier is represented by the p(x|wj), Pj; j = 0,1. Now if we receive a feature vector x, we want to know what is the probability (likelihood) of each class. In other words, what is the probability of wj given x - the POSTERIOR probability. Bayes' Law gives a method of computing the posterior probabilities: 1 (8.3-8) p(wj|x) = Pj*p(x|wj)/[ # Pj*p(x|wj) ] j=0 (each of the quantities on the right-hand side of eqn. 8.3-8 is known - through training) In eqn 8.3-8, the denominator of the right hand side is merely a normalising factor - to ensure that p(wj|x) is a proper probability - ans so can be neglected in cases where we just want maximum probability. Now, classification becomes a matter of computing eqn. 8.3-8, and choosing the class, j, with maximum p(wj|x). The classifiers mentioned in the previous sections were all based on intuitive criteria. The Bayes' classifier is OPTIMUM based on an objective criterion - the class chosen is the most probable, with the consequence that the Bayes' rule is also MINIMUM ERROR; i.e. in the long run it will make fewer errors than ANY other classifier: this is what maximum likelihood means. 8.3.11 Linear Partitions of Feature Space. ------------------------------------------ This section forms the basis of linking of classifiers with neural networks. Consider eqn. 8.3-2, in the case of two classes: Actually, to be perfectly correct, we must subtract the mean of the data, call it m, and mi = ith component of mean; NB. overall mean, NOT class means. p-1 (8.3-2) x.Xj = sum (xi-mi)*(Xji-mi), i=0 where xi is the ith component of vector x Xji ith Xj for d dimensional vectors. Therefore finding maximum correlation reduces to: p-1 p-1 (8.3-9) sum(xi-mi)*(X0i-mi) > sum (xi-mi)*(X1i-mi) => class = 0 i=0 i=0 else class = 1 or p-1 p-1 (8.3-10) sum(xi-mi)*(X0i-mi) - sum (xi-mi)*(X1i-mi) > 0 => class=0 i=0 i=0 else class = 1 = sum {xi.X0i+mi.mi-mi.X0i-mi.xi - [xi.X1i+mi.mi-mi.X1i-mi.xi]} = sum {xi.X0i+mi.mi-mi.X0i-mi.xi -xi.X1i-mi.mi+mi.X1i+mi.xi} ^ ^ ^ ^ these cancel | | | | = sum {xi.X0i -mi.X0i -xi.X1i +mi.X1i } = sum {xi.[X0i-X1i] -mi.X0i +mi.X1i } p-1 = sum {xi.ai} + a0 > 0 => class = 0 (8.3-11) i=0 else 1 p-1 where a0 = sum [-mi.X0i +mi.X1i] = constant i=0 and ai = [X0i-X1i] Geometrically, the decision is defined by the line p-1 sum {xi.ai} + a0 = 0 (8.3-12) i=0 which is an equation of a straight line through the origin; points above the line are judged class 0, those below are class 1. Usually X0, X1 would be m0, m1 the mean vectors of the classes; in this case the line given by sum {a1*xi} bisects and is perpindicular to the line joining the two means. Example. Look again at the data from Chapter 8-1. The * * * shows a linear partition for the two classes. + | 4 + 1 | + * | 3 + *1 1 1 | + 0 | * 2 + 1 | * + 0 0 0 0 = class 0, mean = m0 | 1 = class 1, mean = m1 1 + * * * * = boundary | + 0 * | +----+----+----+----+----+----+----> x1 0 1 2 3 Figure 8.3-3 Linear Boundary in Feature. ---------------------------------------- 8.3.12 Discriminants. -------------------- We can look at eqns. 8.3-11, -12 in yet another way: p-1 y = sum {xi.ai} (8.3-13) i=0 = xT.a (8.3-14) the dot product to vectors x and a (8.3-15) y > a0 => class 1 < a0 => class 0 = a0 => class chosen arbitrarily. Eqns. (8.3-14) and (8.3-15) form what is called a linear discriminant. A diagrammatic form is shown in Figure 8.3-4. T (= a0) x0 | \ | \a1 v \ +----+----+ \ | 1| +-- | +--+--+ | | | | output 1 => class 1 x1 a2 | | y | 0| | | 0 => class 0 ------------+ +---------->+--+--+---+--------> . | | | | T | . /+--+--+ +---------+ / _p-1 /ap-1 y=>ai.xi y>T? output =1 if y>T / - =0 otherwise xp-1 i=1 Figure 8.3-4 Linear Discriminant. --------------------------------- Readers familiar with neural networks will notice the similarity between Figure 8.3-4 (see earlier section 3.2, Figure 3.2-1) and a single neuron using a hard-limiting activation function. A general discriminant function, g, is a function of the form: (8.3-15) y = g(x) (8.3-16) y > T => class 1 < T => class 2 = T => class chosen arbitrarily. 8.3.13 Linear Discriminant as Projection. ---------------------------------------- Figure 8.3-5 shows feature vectors (points, data) for a two-class, two-dimensional feature space. | 1 2 2 2 2 x2 | 1 1 1 2 2 2 2 2 | 1 1 1 1 1 1 2 2 2 2 2 | 1 1 1 1 1 1 1 2 2 2 2 2 | 1 1 1 1 1 1 2 2 | 1 1 1 1 class w2 | 1 | class w1 +--------------+---+---------------------------> t1 t2 x1 Figure 8.3-5 Feature Space, Two-class, two-dim. ----------------------------------------------- We can now examine the utility of a simple linear discriminant - eqn. 8.3-14 : (8.3-14) y = aTx Projecting the data onto the x0 axis would result in a little overlap of the classes (between points t1, and t2 -- see Figure). Projecting data onto axis x2, discriminant vector = (0.0,1.0) would result in much overlap. However, projecting the vectors onto the line shown in Figure 8.3-6 would be close to optimal -- no class overlap. * | 1 2 2 2 * x1 | 1 1 1 2 * 2 2 2 | 1 1 1 1 1 1* 2 2 2 2 2 | 1 1 1 1 1* 1 1 2 2 2 2 2 | 1 1 1 * 1 1 2 2 | 1*1 1 1 class w2 | * 1 | * +-----------------------------------------------> x0 * * * projection line - y0 Figure 8.3-6 Linear Discriminant as Projection. ---------------------------------------------- Projecting the data onto different axes is equivalent to rotating the axes -- ie. in the case above, we would rotate to a new set of axes y0, y1, where y0 is the axis shown as * * * ; obviously, in the case above, we would ignore the second, y1 dimension, since it would contribute nothing to the discrimination of the classes. 8.3.14 The Connection with Neural Networks ------------------------------------------ Recall eqn. 8.3-12 - which interprets correlation as a linear partitioning (assume we are using means as 'perfect' vectors). Note: many treatments of pattern recognition deal with only with two class cases - the maths is usually easier, and it is usually easy to extend to the multiclass case without any loss of generality. p-1 sum {xi.ai} + a0 = 0 (8.3-12) i=0 As noted above, this is a straight line. It bisects the line joining the two means, and is perpindicular to them. In neural networks, a0 is called the BIAS term. Alternatively, for mathematical tidyness, we can include a0 in the summation, and insist that the zeroth element of each feature is always 1 - the so called BIAS input in neural networks. p-1 s = sum {xi.ai} + a0 (8.3-16) i=0 We then subject s to a threshold function, s > 0 => class 1 (8.3-17) <=0 class 0 as in eqn. 8.3-15). In neural networks nowadays, this thresholding is usually done with the so-called sigmoid function, see Chapter 9. 8.3.15 Other Connections. ------------------------ In the early days of pattern recognition (late 1950s, early 60s) there were two major influences: - communication and radar engineering, to whom eqn. 8.3-16 is a MATCHED FILTER, i.e. maximum correlation device, - those who wished to model human pattern recognition. To these, eqn 8.3-16 is a single neuron. See chapter 9. 8.3.16 Discussion ----------------- In all the pattern recognition approaches discussed in this section, the classification rules end up as minimum distance, or very similar. It is comforting to know that correlation and minimum distance boil down to the same thing, and that these are related to maximum likelihood, AND, very closely, to neural networks. Furthermore, it can be shown that the nearest neighbour classifier - which is based on intuitition - is very closely related to the Bayes' classifier. Also, many practical implementations of the Bayes' classifier end up using minimum distance (e.g. nearest mean). Usually, it is the choice of features that governs the effectiveness of the system, not the classifier rule. We now consider two practical pattern recognition problems related to image processing. 8.4 Two-dimensional Shape Recognition ------------------------------------- We now return to the character recognition problem. Of course, this is representative of many two-dimensional shape recognition problems, e.g. text-character recognition. The major problem is locating the character/shape, and extracting invariant features. A possible solution is given by the following steps: (1) Segmentation: label ink/object pixels, versus background/ non-ink pixels. See Chapter 7. This removes the effects of amplitude/brightness; also makes step (2) easier. (2) Isolate the object: e.g. starting off at the top-left corner of the object, move right and down, simultaneously scanning each column and each row; when there is a break - no ink - this marks the bottom-right corner of the object. (3) Feature Extraction: we need invariant features, e.g. 2-d. central moments, see G&W section 8.3.4 (p. 514) and the next section (8.5) of these notes. (4) Gather Training Data: obtain a representative set of feature vectors for each class of object, (5) Test data: same requirement as training. (6) Train the classifier: train on the training set for each class, e.g. learn statistics (mean, standard deviation), or somehow find out the occupancy of the classes in measurement/feature space. (7) Classifier: see chapter 8.3: - Statistical: Bayes' Rule. - Geometric: simple nearest mean classifier, or 'box' classifier - nearest neighbour, or k-nn. 8.5 Two-dimensional Invariant Moments for Planar Shape Recognition. ------------------------------------------------------------------ Assume we have isolated the object in the image (see chapter 8.4): its bounds are xl..xh, yl..yh. Two-dimensional moments are given by: p q (8.5-1) mpq = # # x * y * f[x,y] x y for p, q = 0, 1, 2... These are not invariant to anything, yet. (8.5-2) x~ = m10/m00 gives the x-centre of gravity of the object, and (8.5-3) y~ = m01/m00 gives the y-centre of gravity. Now we can obtain shift invariant features by referring all coordinates to the centre of gravity (x~, y~). These are the so-called central moments: p q (8.5-4) m'pq = # # (x-x~) * (y - y~) * f[x,y] x y Note: these are called mu (Greek letter) in the literature. The first few m' can be interpreted as follows: m'00 = m00 = sum of the grey levels in the object, m'10 = m'01 = 0 , always, i.e. centre of gravity is (0,0) with respect to itself. m'20 = measure of width along x-axis m'02 = measure of width along y-axis. From the m'pq can be derived a set of normalised moments: (8.5-5) npq = m'pq/(m'00^g) where g = (p+q)/2 +1 Finally, a set of seven fully shift, rotation, and scale invariant moments can be defined: (8.5-6) p1 = n20 + n02 p2 = (n20 - n02)**2 + 4n11**2 etc... see G&W if interested. Example 8.5-1. An apple pie manufacturer is setting up an automatic conveyer belt inspection system for the pastry tops for his pies. He requires: (1) the tops must be circular (to within some defined tolarance), (2) there must be no holes in the top. The inspection can be carried out using the model given in section 8.4 and chapter 2. Assuming the background grey level sufficiently contrasts with that of the pastry, location and segmentation should be simple. After segmentation, inspection (2) can be carried out by searching for any background pixels in the interior of the pastry. Inspection (1) can be done using invariant moments; we demand feature vectors within a certain small distance (the tolerance) of the perfect shape. Note: if the input camera is always the same distance from the conveyer belt, we don't need scale invariance. And, because the pie tops are circular, we don't need rotation invariance. Thus, simple features might be (say) four, or eight diameters sampled at equally spaced angles around the pie; of course, x~, and y~ give the centre of the pie top, through which the diameters must pass. An alternative would be to detect the edges of the pie top, and therby compute the circumference, c. Compute the diameter as above, d. Then, c/d must be close to PI, otherwise it's a bad pi(e)!! Ex. 8.5-2. At Easter the pie manufacturer branches into hot-cross buns. In this case, he is interested, not only in the roundness, but in the perfection of the cross on top of the bun. Suggest possible features, and classifier techniques. 8.6 Classification Based on Spectral Features. ---------------------------------------------- For example, we require a system to automate the process of land-use mapping using multispectral (i.e. multicolour) satellite images. Or, maybe, to discriminate rain clouds from non-rain; or, to detect pollution in water. Feature Extraction: ------------------ In this case the features are easier to obtain; in fact, the pixel values (in each colour) will do; i.e. the feature vector is the same as the observation vector - the plain radiance value in the spectral band. It might be neccessary to do some preprocessing, e.g. correction for atmospheric effects. Training data: ------------- Obtain representative (NB. representative) samples of each land-use class (this is called ground data, or ground-truth). This requires field work, or use existing maps. Test data: --------- Same requirement as training; again NB. we need fully representative data. Train classifier: ---------------- Train on samples for each class, learn statistics (mean, standard deviation), or somehow find out the occupancy of the classes in measurement/feature space. Classifier: ---------- See above. The output is an image containing class labels, these can be appropriately colour coded for map production. Test / Evaluation: ----------------- Run the classifier on the test data. Compare 'true' classes, and 'classifier' classes. Produce error matrix, or simply percentage of pixels incorrectly classified. NB. We must use different data for testing - NOT training data. Geometric Correction: -------------------- Scale, shift, rotate the image to correspond to map coordinates. Sometimes called geometric calibration. Pitfalls and difficulties: ------------------------- Poor features. The features just do not discriminate the classes. If the raw spectral bands do not discriminate, we may need to take the ratio of colours/bands. Or, we may need to use texture. Or, we may need to combine images from different parts of season - this will require image registration, see Chapter 3. Poor training data. Not representative - maybe contains only one sub-set of the real class - and so is not representative of the true class variability. Or, we have pixels of another class mixed in. Mixed pixels. i.e. pixels that 'straddle' boundaries between class regions; bad for training data. Also difficult to classify correctly. Overlapping classes. Some of our classes actually may have same 'colour'; to solve this problem we need additional 'features' e.g. texture. Or use context, or try multiseason data - see above. Testing on training data. This is a common error, somewhat akin to the same person specifying, writing, and testing a piece of software: the results of the tests usually give good news - whatever the reality! 8.7 Some Common Problems in Pattern Recognition. ------------------------------------------------ - bad features. No amount of 'clever' classification can remedy the situation if the features are not properly desciptive. - large dimensionality. The larger the dimensionality of the feature vectors, the more training data required. Typically four or more samples per class, per dimension. E.g. dimensionality, d, = 50, no. of samples = 250 per class. - mismatch in range of features. E.g. feature 1 ranges 0 to 1, feature 2 ranges 0 to 100: variations in feature 2 will swamp variations in feature 1, in for example, distance or correlation measures. We need to normalise the components of the feature vector, to equal ranges of (say) 0 to 1. - testing using training data. - multimodal distributions, i.e. really two classes combined. May fool statistical algorithms. 8.8 Problems that may be solved by Pattern Recognition Techniques. ----------------------------------------------------------------- Engine Monitoring. ----------------- Observation vector: sound signal. Feature vector: Fourier Spectrum of sound signal. Classes: Healthy, not healthy - needs maintenance. Medical Diagnosis. ----------------- Observation vector: Binary - list of symtoms, e.g. x1 = pain in hamstring, x2 = history of back trouble, x3 = pain intermittent x4 = pain constant x5 = recent athletic activity Feature vector: = observation vector Classes: w1 = Sciatica, w2 = pulled muscle, w3 = stiffness, w4 = don't know - refer to specialist. Information Retrieval. --------------------- Features: presence of keywords. Classes: subject areas. Burgular Alarm. --------------- Features: sound level in microphone, current level in loop sensors, readings from infra-red temperature sensors. Classes: Intruder - ring bell, no intruder. 8.10. Exercises. --------------- 1. Draw a histogram for the following data (one-dimensional features) from two classes: class 0, w0: 1.21 3.11 3.97 6.21 1.32 3.12 4.12 6.58 1.40 3.21 4.30 7.00 1.56 3.31 4.70 2.07 3.37 4.86 2.21 3.45 4.92 2.22 3.50 4.97 2.73 3.78 5.10 3.00 3.90 5.70 class 1, w1: 6.89 10.03 11.23 11.71 12.37 8.01 10.31 11.25 11.82 13.01 8.76 10.45 11.34 11.99 13.50 9.25 10.56 11.37 12.22 13.57 9.33 10.72 11.45 12.32 14.60 9.76 10.80 11.60 12.33 Determine a decision boundary (threshold) that classifies the points with minimum error. 2. Determine the means of each class. What is the effective decision boundary for the nearest mean classifier. 3. If you wanted to use a nearest neighbour classifier, but decided to 'condense' the points, which points are significant and must be retained, in order to give minimum error. 4. Pick 10 men and 10 women, at random (more if you have time). Find their height - inches. Plot a histogram. Design a histogram based (statistical) classifier that distinguishes men from women. 5. Design a nearest mean classifier using the data from 4. Compare the result. 6. Add an additional, second, feature to ex. 4, i.e. shoe size. Plot a two dimensional scatter plot (or histogram, if you wish). Plot the optimum boundary between the classes - based on the scatter plot. 7. Using the data from 6 work out the means vectors of the classes. Plot the linear boundary that would be formed by a nearest mean classifier. 8. Design a pattern recognition system for coins (money). What are the: - observations, - features. Suggest simple classifier rules. 9. In the character recognition example given in Chapter 8.1 the feature space is nine dimensional. Thus, visualisation of the data in feature space is difficult. The following example is easier to visualise. Consider an imaging system which has just two pixels - or, an animal which has just two light sensitive cells in its retina, see Figure 8.10-1. Call the outputs of these x1, and x2, therefore they form a two-dimensional vector x = (x1,x2). x1 x2 +-----+-----+ | | | | | | +-----+-----+ Figure 8.10-1 Two Pixel Image ----------------------------- (a) If the components are binary (0 or 1) we can consider a problem which wishes to distinguish 'bright' objects - class 1, from dark - class 0. For now we will define class 1 as 'both pixels light'. I.e. we have ['*' denotes light (1)]: x1 x2 +-----+-----+ |*****|*****| class 1 |*****|*****| +-----+-----+ x1 x2 +-----+-----+ |*****| | class 0 |*****| | +-----+-----+ x1 x2 +-----+-----+ | |*****| class 0 | |*****| +-----+-----+ x1 x2 +-----+-----+ | | | class 0 | | | +-----+-----+ Figure 8.10-2 ------------- Note the similarity with the Boolean AND function. The feature space representation of these classes are shown in Figure 8.10-3; '@' represents class 0, '*' represents class 1. We have shown a linear boundary which segregates the classes. ^ \ 1 @ \ * | \ | \ x2 | class 0 \ class 1 | \ | \ | \ | \ 0 @-----------------------------@> \ 0 x1 1 Figure 8.10-3 Two-dimensional Scatter Diagram - Feature Space ------------------------------------------------------------- (b) Let us change to a problem which wishes to distinguish striped objects (class 0, say) from plain (class 1). I.e. we have ['*' denotes light (1)]: x1 x2 +-----+-----+ |*****|*****| class 1 |*****|*****| +-----+-----+ x1 x2 +-----+-----+ |*****| | class 0 |*****| | +-----+-----+ x1 x2 +-----+-----+ | |*****| class 0 | |*****| +-----+-----+ x1 x2 +-----+-----+ | | | class 1 | | | +-----+-----+ Figure 8.10-4 ------------- Draw the feature space diagram. Draw appropriate decision boundary line(s) - note the difficulty compared to (a). Note the similarity with the Boolean XOR function. (c) Let us change to a problem which wishes to distinguish left- handed objects (class 1, say) from right-handed (class 2), with neither left- or right-handed as reject, class 0. I.e. we have ['*' denotes light (1)]: x1 x2 +-----+-----+ |*****|*****| class 0 |*****|*****| +-----+-----+ x1 x2 +-----+-----+ |*****| | class 1 |*****| | +-----+-----+ x1 x2 +-----+-----+ | |*****| class 2 | |*****| +-----+-----+ x1 x2 +-----+-----+ | | | class 0 | | | +-----+-----+ Figure 8.10-4 ------------- Draw the feature space diagram. Show the linear boundaries. (d) (See (a)) Describe a state of affairs that corresponds to Boolean OR; draw the diagrams. Will a single linear boundary do? [Yes]. 10. Change the system in Ex. 9 to allow non-binary data. Allow the data to extend from 0 to +1 and assume Real values (e.g. 0.995, 0.0256). Now extend 9(a) to (d) assuming that there are small amounts of noise on the pixels, e.g. we have values spread over the range 0.9 to 1.0 for light, and 0.0 to 0.1 for dark. Draw the feature space diagrams for each case (a) to (d). Draw suggested linear boundaries. 11. Now let the noise in Ex. 10 increase. Now, we have values spread over the range 0.55 to 1.0 for light, and 0.0 to 0.45 for dark. (i) Draw the feature space diagrams for each case (a) to (d). (ii) Draw suggested linear boundaries. 12. Now let the noise in Ex. 11 increase further. Now, we have values spread over the range 0.4 to 1.0 for light, and 0.0 to 0.6 for dark. (i) Draw the feature space diagrams for each case (a) to (d). (ii) Draw suggested linear boundaries. (iii) Suggest alternatives to the 'linear' classifiers. 13. In section 8.6 we mentioned the problem of mixed pixels. Research 'fuzzy sets' and suggest how they could be applied to this problem. [Note: a class can be considered to be defined as a BINARY membership function in feature space; i.e. it is a SET - a feature value is either a member (1) or not (0). A fuzzy set is one which can take on continuous membership values in the range 0 to 1, e.g. 0.0, 0.1, low membership; 0.9,0.95, 1.0 high membership]. How would fuzzy classes change our ability to define decision boundaries? 14. IRIS data. [R.A. Fisher. 1936. Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, Vol. 7, pp. 179 - 188.] The following data (in files 'ih1.asc') contain a subset of the famous 'IRIS' (flowers) data. The data are four-dimensional: x0 = sepal length, x1 = sepal width, x2 = petal length, x3 = petal width. There are two classes - corresponding to two families of IRIS. /*dh rh ch 3, 0, 49 cl x0 x1 x2 x3 1 5.1,3.5,1.4,0.2 1 4.9,3.0,1.4,0.2 1 4.7,3.2,1.3,0.2 1 4.6,3.1,1.5,0.2 1 5.0,3.6,1.4,0.2 1 5.4,3.9,1.7,0.4 1 4.6,3.4,1.4,0.3 1 5.0,3.4,1.5,0.2 1 4.4,2.9,1.4,0.2 1 4.9,3.1,1.5,0.1 1 5.4,3.7,1.5,0.2 1 4.8,3.4,1.6,0.2 1 4.8,3.0,1.4,0.1 1 4.3,3.0,1.1,0.1 1 5.8,4.0,1.2,0.2 1 5.7,4.4,1.5,0.4 1 5.4,3.9,1.3,0.4 1 5.1,3.5,1.4,0.3 1 5.7,3.8,1.7,0.3 1 5.1,3.8,1.5,0.3 1 5.4,3.4,1.7,0.2 1 5.1,3.7,1.5,0.4 1 4.6,3.6,1.0,0.2 1 5.1,3.3,1.7,0.5 1 4.8,3.4,1.9,0.2 2 7.0,3.2,4.7,1.4 2 6.4,3.2,4.5,1.5 2 6.9,3.1,4.9,1.5 2 5.5,2.3,4.0,1.3 2 6.5,2.8,4.6,1.5 2 5.7,2.8,4.5,1.3 2 6.3,3.3,4.7,1.6 2 4.9,2.4,3.3,1.0 2 6.6,2.9,4.6,1.3 2 5.2,2.7,3.9,1.4 2 5.0,2.0,3.5,1.0 2 5.9,3.0,4.2,1.5 2 6.0,2.2,4.0,1.0 2 6.1,2.9,4.7,1.4 2 5.6,2.9,3.6,1.3 2 6.7,3.1,4.4,1.4 2 5.6,3.0,4.5,1.5 2 5.8,2.7,4.1,1.0 2 6.2,2.2,4.5,1.5 2 5.6,2.5,3.9,1.1 2 5.9,3.2,4.8,1.8 2 6.1,2.8,4.0,1.3 2 6.3,2.5,4.9,1.5 2 6.1,2.8,4.7,1.2 2 6.4,2.9,4.3,1.3 I have split the original data into two halves; the other half is in 'ih2.asc'. (a) Run DataLab batch file 'ipiris', which will configure and read in ih1.asc - to image 0, and ih2.asc - to image 1. It will also show some scatter plots and histograms. (b) (b-1) run clnm (nearest mean classifier) on image 0 (source - training data) and image 1 (destination - test data). (b-2) Run 'clcfus' on 1 to see the result. (b-3) Check by examining image 1 using 'tpsv'. (c) (c-1) run clnn (nearest neighbour classifier) on image 0 (source - training data) and image 1 (destination - test data). (c-2) Run 'clcfus' on 1 to see the result. (c-3) Check by examining image 1 using 'tpsv'. (d) (d-1) run clbpnn (backpropogation neural network classifier) on image 0 (source - training data) and image 1 (destination - test data). (d-2) Run 'clcfus' on 1 to see the result. (d-3) Check by examining image 1 using 'tpsv'. 15. Here are the data introduced in section 8.3. label cllab x0 x1 1 1 : 0.40 1.50 1 1 : 1.00 0.50 1 1 : 1.00 1.50 1 1 : 1.00 2.50 1 1 : 1.60 1.50 2 2 : 1.40 3.00 2 2 : 2.00 2.00 2 2 : 2.00 3.00 2 2 : 2.00 4.00 2 2 : 2.60 3.00 (a) Draw a scatter plot for the data [do this first, it will make the rest of the question easier], (b) Wrok out the class means, (c) Hence, apply a nearest mean classifier to the patterns: x0 x1 1.0,1.0 2.0,3.5 5.0,5.0 (d) Compare the error results that you would obtain by applying (i) a nearest mean classifier, and (ii) a nearest neighbour classifier to the training data (ie. the training data are the data in the tables above, and you are also using these data for testing). (e) Illustrate a linear boundary classifier that would suit these training data. (f) Design and demonstrate a neural network classifier for these data; comment on the relationship between your result and that obtained in (e). (g) Design and demonstrate a maximum likelihood classifier for these data. end of file.