Character Recognition by Feature Point Extraction

Eric W. Brown

feneric@ccs.neu.edu

Abstract

The ability to identify machine printed characters in an automated or a semi-automated manner has obvious applications in numerous fields. Since creating an algorithm with a one hundred percent correct recognition rate is quite probably impossible in our world of noise and different font styles, it is important to design character recognition algorithms with these failures in mind so that when mistakes are inevitably made, they will at least be understandable and predictable to the person working with the program. This paper explores one such algorithm and tests it on two different fonts using a third font as a reference. The results are discussed and several improvements are suggested.

Keywords

Text recognition, Optical character recognition, Feature extraction, Pattern recognition, Feature point, OCR, Off-line character recognition, machine printed character recognition.

Optical character recognition (OCR) has been a topic of interest since possibly the late 1940's when Jacob Rabinow started his work in the field1. The earliest OCR machines were primitive mechanical devices with fairly high failure rates. As the amount of new written material increased, so did the need to process it all in a fast and reliable manner, and these machines were clearly not up to the task. They quickly gave way to computer-based OCR devices that could outperform them both in terms of speed and reliability.

Today there are many OCR devices in use based on a plethora of different algorithms. All of the popular algorithms sport high accuracy and most high speed, but still many suffer from a fairly simple flaw: when they do make mistakes (and they all do), the mistakes are often very unnatural to the human point of view. That is, mistaking a "5" for an "S" is not too surprising because most people are willing to agree that these two characters are similar, but mistaking a "5" for an "M" is counter-intuitive and unexpected. Algorithms make such mistakes because they generally operate on a different set of features than humans for computational reasons. Humans observe strokes and the relations between them, while algorithms measure anything from Transformation Ring Projections2 of a character to the Fourier Transform of the Horizontal-Vertical Projections3 of a character. These methods do work and are often computationally efficient, but they make the computer see letters through a decidedly non-human set of eyes.

The importance of making the same sorts of mistakes as a human may not be immediately obvious, but it is important to realize that the main purpose of OCR is to facilitate communication between humans. Mistakes typical of humans can be more readily corrected by humans (i.e. "5ave" is easier to connect with "Save" than "Mave").

This paper describes an algorithm that attempts to work with a subset of the features in a character that a human would typically see for the identification of machine-printed English characters. Its recognition rate is currently not as high as the recognition rates of the older, more developed character recognition algorithms, but it is expected that if it were expanded to work with a larger set of features this problem would be removed. If it were expanded to use more features, it would be made correspondingly slower; with the advent of faster microprocessors this fact is not viewed as a crippling problem.

The characters in most Western character sets can comfortably be described using only an eight by eight grid. Since the focus of this algorithm is currently just machine printed English characters the algorithm was designed to handle eight by eight data. It should not be impossible to use this algorithm on characters with higher resolution, but some method of data reduction would have to be applied to the raw data first. A block reduction method should produce acceptable results for most machine printed data, but there is no reason that a good thinning algorithm4 could not do just as well provided that the input needs of this algorithm are properly understood.

The algorithm presently avoids thinning (and other preprocessing) by assuming that the input eight by eight data is not particularly aberrant -- lines in an eight by eight grid should not normally be thicker than two pixels. With this assumption, it then proceeds to look for feature points.

A feature point is a point of human interest in an image, a place where something happens. It could be an intersection between two lines, or it could be a corner, or it could be just a dot surrounded by space. Such points serve to help define the relationship between different strokes. Two strokes could fully cross each other, come together in a "Y" or a "T" intersection, form a corner, or avoid each other altogether. People tend to be sensitive to these relationships; the fact that the lines in a "Z" connect in a certain way is more important than the individual lengths of those lines. These relationships are what should be used for character identification, and the feature points can be exploited for the task.

The procedure for extracting these feature points utilized by this algorithm is fairly straightforward. Since an eight by eight character consists of only sixty-four pixels, it is viable to simply loop through the entire character and examine each pixel in turn. If a pixel is on, its eight neighbors are checked. Since each neighbor can also only be on or off, there are merely 256 possible combinations of neighborhoods. Of these 256, fifty-eight were found to represent significant feature points in a fairly unambiguous way. Extracting feature points thus reduced to calculating a number between zero and 256 to describe a pixel's neighborhood and then comparing that number against a table of known feature points (see Table 1, "Enumeration of Possible Pixel Neighborhoods," ). While it is true that this method does not always catch every feature point (some can only be seen in a larger context) it catches the majority. Missing feature points is certainly not a limiting factor in the algorithm's accuracy. It also does not suffer from labelling too many uninteresting points as being feature points. It has virtually no false positives. The feature point extractor is thus fast and reliable.

Table 1: Enumeration of Possible Pixel Neighborhoods

   
 + 
   
   
 + 
 * 
   
 + 
*  
   
 + 
** 
   
*+ 
   
   
*+ 
 * 
   
*+ 
*  
   
*+ 
** 
*  
 + 
   
*  
 + 
 * 
*  
 + 
*  
*  
 + 
** 
*  
*+ 
   
*  
*+ 
 * 
*  
*+ 
*  
*  
*+ 
** 
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
 * 
 + 
   
 * 
 + 
 * 
 * 
 + 
*  
 * 
 + 
** 
 * 
*+ 
   
 * 
*+ 
 * 
 * 
*+ 
*  
 * 
*+ 
** 
** 
 + 
   
** 
 + 
 * 
** 
 + 
*  
** 
 + 
** 
** 
*+ 
   
** 
*+ 
 * 
** 
*+ 
*  
** 
*+ 
** 
1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241
  *
 + 
   
  *
 + 
 * 
  *
 + 
*  
  *
 + 
** 
  *
*+ 
   
  *
*+ 
 * 
  *
*+ 
*  
  *
*+ 
** 
* *
 + 
   
* *
 + 
 * 
* *
 + 
*  
* *
 + 
** 
* *
*+ 
   
* *
*+ 
 * 
* *
*+ 
*  
* *
*+ 
** 
2 18 34 50 66 82 98 114 130 146 162 178 194 210 226 242
 **
 + 
   
 **
 + 
 * 
 **
 + 
*  
 **
 + 
** 
 **
*+ 
   
 **
*+ 
 * 
 **
*+ 
*  
 **
*+ 
** 
***
 + 
   
***
 + 
 * 
***
 + 
*  
***
 + 
** 
***
*+ 
   
***
*+ 
 * 
***
*+ 
*  
***
*+ 
** 
3 19 35 51 67 83 99 115 131 147 163 179 195 211 227 243
   
 +*
   
   
 +*
 * 
   
 +*
*  
   
 +*
** 
   
*+*
   
   
*+*
 * 
   
*+*
*  
   
*+*
** 
*  
 +*
   
*  
 +*
 * 
*  
 +*
*  
*  
 +*
** 
*  
*+*
   
*  
*+*
 * 
*  
*+*
*  
*  
*+*
** 
4 20 36 52 68 84 100 116 132 148 164 180 196 212 228 244
 * 
 +*
   
 * 
 +*
 * 
 * 
 +*
*  
 * 
 +*
** 
 * 
*+*
   
 * 
*+*
 * 
 * 
*+*
*  
 * 
*+*
** 
** 
 +*
   
** 
 +*
 * 
** 
 +*
*  
** 
 +*
** 
** 
*+*
   
** 
*+*
 * 
** 
*+*
*  
** 
*+*
** 
5 21 37 53 69 85 101 117 133 149 165 181 197 213 229 245
  *
 +*
   
  *
 +*
 * 
  *
 +*
*  
  *
 +*
** 
  *
*+*
   
  *
*+*
 * 
  *
*+*
*  
  *
*+*
** 
* *
 +*
   
* *
 +*
 * 
* *
 +*
*  
* *
 +*
** 
* *
*+*
   
* *
*+*
 * 
* *
*+*
*  
* *
*+*
** 
6 22 38 54 70 86 102 118 134 150 166 182 198 214 230 246
 **
 +*
   
 **
 +*
 * 
 **
 +*
*  
 **
 +*
** 
 **
*+*
   
 **
*+*
 * 
 **
*+*
*  
 **
*+*
** 
***
 +*
   
***
 +*
 * 
***
 +*
*  
***
 +*
** 
***
*+*
   
***
*+*
 * 
***
*+*
*  
***
*+*
** 
7 23 39 55 71 87 103 119 135 151 167 183 199 215 231 247
   
 + 
  *
   
 + 
 **
   
 + 
* *
   
 + 
***
   
*+ 
  *
   
*+ 
 **
   
*+ 
* *
   
*+ 
***
*  
 + 
  *
*  
 + 
 **
*  
 + 
* *
*  
 + 
***
*  
*+ 
  *
*  
*+ 
 **
*  
*+ 
* *
*  
*+ 
***
8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248
 * 
 + 
  *
 * 
 + 
 **
 * 
 + 
* *
 * 
 + 
***
 * 
*+ 
  *
 * 
*+ 
 **
 * 
*+ 
* *
 * 
*+ 
***
** 
 + 
  *
** 
 + 
 **
** 
 + 
* *
** 
 + 
***
** 
*+ 
  *
** 
*+ 
 **
** 
*+ 
* *
** 
*+ 
***
9 25 41 57 73 89 105 121 137 153 169 185 201 217 233 249
  *
 + 
  *
  *
 + 
 **
  *
 + 
* *
  *
 + 
***
  *
*+ 
  *
  *
*+ 
 **
  *
*+ 
* *
  *
*+ 
***
* *
 + 
  *
* *
 + 
 **
* *
 + 
* *
* *
 + 
***
* *
*+ 
  *
* *
*+ 
 **
* *
*+ 
* *
* *
*+ 
***
10 26 42 58 74 90 106 122 138 154 170 186 202 218 234 250
 **
 + 
  *
 **
 + 
 **
 **
 + 
* *
 **
 + 
***
 **
*+ 
  *
 **
*+ 
 **
 **
*+ 
* *
 **
*+ 
***
***
 + 
  *
***
 + 
 **
***
 + 
* *
***
 + 
***
***
*+ 
  *
***
*+ 
 **
***
*+ 
* *
***
*+ 
***
11 27 43 59 75 91 107 123 139 155 171 187 203 219 235 251
   
 +*
  *
   
 +*
 **
   
 +*
* *
   
 +*
***