Sargur N. Srihari & Stephen W. Lam Center of Excellence for Document Analysis and Recognition State University of New York at Buffalo 520 Lee Entrance, Suite 202 Amherst, NY 14228-2567
Character Recognition or Optical Character Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), into a computer processable format (such as ASCII). This article describes the design of OCR systems and their applications.
A typical OCR system (Fig. 1) contains three logical components:
Fig.1
The image scanner optically captures text images to be recognized. Text images are processed with OCR software and hardware. The process involves three operations: document analysis (extracting individual character images), recognizing these images (based on shape), and contextual processing (either to correct misclassifications made by the recognition algorithm or to limit recognition choices). The output interface is responsible for communication of OCR system results to the outside world.
Four basic building blocks form functional image scanners: a detector (and associated electronics), an illumination source, a scan lens, and a document transport. The document transport places the document in the scanning field, the light source floods the object with illumination, and the lens forms the object's image on the detector. The detector consists of an array of elements each of which convert incident light into a charge, or analog signal. These analog signals are then converted into an image. Scanning is performed by the detector and the motion of the text object with respect to the detector. After an image is captured, the document transport removes the document from the scanning field.
Recent advances in scanner technology have made available higher resolution, often in the range of 300 pixels per inch (ppi) to 400 ppi. Recognition methods that use features (as opposed to template matching) use resolutions of at least 200 ppi and careful consideration of gray-scale. Lower resolutions and simple thresholding tend to break thin lines or fill gaps, thus invalidating features.
The software/hardware system that recognizes characters from a registered image can be divided into three operational steps: document analysis, character recognition, and contextual processing.
Text is extracted from the document image is a process known as document analysis. Reliable character segmentation and recognition depend upon both original document quality and registered image quality. Processes that attempt to compensate for poor quality originals and/or poor quality scanning include image enhancement, underline removal, and noise removal. Image enhancement methods emphasize character versus non-character discrimination. Underline removal erases printed guidelines and other lines which may touch characters and interfere with character recognition and noise removal erases portions of the image that are not part of the characters.
Prior to character recognition it is necessary to isolate individual characters from the text image. Many OCR systems use connected components for this process. For those connected components that represent multiple or partial characters, more sophisticated algorithms are used. In low quality or non-uniform text images these sophisticated algorithms may not correctly extract characters and thus, recognition errors may occur. Recognition of unconstrained handwritten text can be very difficult because characters cannot be reliably isolated especially when the text is cursive handwriting.
Two essential components in a character recognition algorithm are the feature extractor and the classifier. Feature analysis determines the descriptors, or feature set, used to describe all characters. Given a character image, the feature extractor derives the features that the character possesses. The derived features are then used as input to the character classifier.
Template matching, or matrix matching, is one of the most common classification methods. In template matching, individual image pixels are used as features. Classification is performed by comparing an input character image with a set of templates (or prototypes) from each character class. Each comparison results in a similarity measure between the input character and the template. One measure increases the amount of similarity when a pixel in the observed character is identical to the same pixel in the template image. If the pixels differ the measure of similarity may be decreased. After all templates have been compared with the observed character image, the character's identity is assigned as the identity of the most similar template.
Template matching is a trainable process because template characters may be changed. In many commercial systems, PROMs (programmable read-only memory) store templates containing single fonts. To retrain the algorithm the current PROMs are replaced with PROMs that contain images of a new font. Thus, if a suitable PROM exists for a font then template matching can be trained to recognize that font. The similarity measure of template matching may also be modified, but commercial OCR systems typically do not allow this.
Structural classification methods utilize structural features and decision rules to classify characters. Structural features may be defined in terms of character strokes, character holes, or other character attributes such as concavities. For instance, the letter P may be described as a vertical stroke with a hole attached on the upper right side. For a character image input, the structural features are extracted and a rule-based system is applied to classify the character. Structural methods are also trainable but construction of a good feature set and a good rule-base can be time-consuming.
Many character recognizers are based on mathematical formalisms that minimize a measure of misclassification. These recognizers may use pixel-based features or structural features. Some examples are discriminant function classifiers, Bayesian classifiers, artificial neural networks (ANNs), and template matchers. Discriminant function classifiers use hypersurfaces to separate the featural description of characters from different semantic classes and in the process reduce the mean-squared error. Bayesian methods seek to minimize the loss function associated with misclassification through the use of probability theory. ANNs, which are closer to theories of human perception, employ mathematical minimization techniques. Both discriminant functions and ANNs are used in commercial OCR systems.
Character misclassifications stem from two main sources: poor quality character images and poor discriminatory ability. Poor document quality, image scanning, and preprocessing can all degrade performance by yielding poor quality characters. On the other hand, the character recognition method may not have been trained for a proper response on the character causing the error. This type of error source is difficult to overcome because the recognition method may have limitations and all possible character images can not possibly be considered in training the classifier. Recognition rates for machine-printed characters can reach over 99%but handwritten character recognition rates are typically lower because every person writes differently. This random nature often manifests itself by resulting in misclassifications. Fig.2 shows several examples of machine printed and handwritten capital O's. Each capital O can be easily confused with the numeral 0 and the number of different styles of capital O's demonstrates the difficulties recognizers must cope with.
Fig.2 Machine printed and handwritten capital O's.
Contextual information can be used in recognition. The number of word choices for a given field can be limited by knowing the content of another field, e.g., in recognizing the street name in an address, by correctly recognizing the ZIP Code, the street name choices can be limited to a lexicon. Alternatively, the result of recognition can be post-processed to correct the recognition errors. One method used to postprocess character recognition results is to apply a spelling checker to verify word spelling. Similarly, other postprocessing methods use lexicons to verify word results or recognition results may be verified interactively with the user. Additional methods to correct or prevent errors using contextual knowledge are state-of-the-art and should appear in commercial systems shortly.
Recognition of scripts other than Roman has worldwide interest. There are some 26 different scripts in use today. Some of the scripts have had little work done on their recognition, e.g., Kannada, while a significant amount of work has been done on others, e.g., Japanese. In addition to alphanumerals, Japanese text uses Kanji characters (Chinese ideographs) and Kana (Japanese syllables). Therefore, it is conceivably more difficult to recognize Japanese text because of the size of the character set (usually more than 3,300 characters) and the complexity and similarity of the Kanji character structures (see Figure ). Low data quality is an additional problem in all OCRs. A Japanese OCR is usually composed of two individual classifiers (pre-classifier and secondary classifier) in a cascade structure. The pre-classifier first performs a fast coarse classification to reduce the character set to a short candidate list (usually contains no more than 100 candidates). The secondary classifier then uses more complex features to determine which candidate in the list has the closest match to the test pattern.
Difficulties in Japanese character recognition:
Fig.3(a) Presence of complex Kanji characters.
Fig.3(b) Many characters share the same lexicographical element.
Fig.3(c) Diverse print qualities (each row is the same character).
The output interface allows character recognition results to be electronically transferred into the domain that uses the results. For example, many commercial systems allow recognition results to be placed directly into spread sheets, databases, and word processors. Other commercial systems use recognition results directly in further automated processing and when the processing is complete, the recognition results are discarded. In any event, the output interface, while simple, is vital to the commercial success of OCR systems because it communicates results to the world outside of the OCR system.
Modern OCR technology is said to have been born in 1951 with M. Sheppard's invention, GISMO - A Robot Reader-Writer. In 1954, J. Rainbow developed a prototype machine that was able to read uppercase typewritten output at the ``fantastic'' speed of one character per minute. Several companies, including IBM, Recognition Equipment, Inc., Farrington, Control Data, and Optical Scanning Corporation, marketed OCR systems by 1967. During the late 1960's, the technology underwent many dramatic developments, but OCR systems were considered exotic and futuristic, being used only by government agencies or large corporations. Systems that cost one million dollars were not uncommon.
In the early years of OCR many standards were developed to help guide automatic document processing. These standards included:
Standardized fonts:
Fig.4(a) OCR-A font
Fig.4(b) OCR-B font
Fig.4(c) Handwritten font
Today, OCR systems are less expensive, faster, and more reliable. It is not uncommon to find PC-based OCR systems for under $8,000 capable of recognizing several hundred characters per minute. More fonts can be recognized than ever with today's OCR systems and some systems advertise themselves as omnifont - able to read any machine printed font. Less expensive electronic components and extensive research have paved the way for these new systems. With continued commercial demand for OCR systems these trends will continue. Increased productivity by reducing human intervention and the ability to efficiently store text are two major selling points.
Current research areas in OCR include handwriting recognition and form ``reading''. Reliable recognition of handwritten cursive script is now under intense investigation. In addition, research is being conducted in ``reading'' forms, that is, using all available information to formulate an interpretation of the document. For instance, some United States Postal Service research focuses on assigning ZIP Codes to letter images which may not contain any ZIP Code. By understanding the various address fields such an assignment can be made. The use of contextual information in both handwriting recognition and form reading is essential.
Hundreds of OCR systems have been developed since the 1950s and many are commercially available today. Commercial OCR systems can largely be grouped into two categories: task-specific readers and general purpose page readers. A task-specific reader handles only specific document types. Some of the most common task-specific readers read bank check, letter mail, or credit card slips. These readers usually utilize custom-made image lift hardware that captures only a few predefined document regions. For example, a bank check reader may just scan the courtesy amount field and a postal OCR system may just scan the address block on a mail piece. Such systems emphasize high throughput rates and low error rates. Applications such as letter mail reading have throughput rates of 12 letters per second with error rates less than 2%. The character recognizer in many of task-specific readers is able to recognize both handwritten and machine-printed text.
General purpose page readers are designed to handle a broader range of documents such as business letters, technical writings and newspapers. These systems capture an image of a document page and separate the page into text regions and non-text regions. Non-text regions such as graphics and line drawings are often saved separately from the text and associated recognition results. Text regions are segmented into lines, words, and characters and the characters are passed to the recognizer. Recognition results are output in a format that can be postprocessed by application software. Most of these page readers can read machine written text but only a few can read hand-printed alphanumerics.
Task-specific readers are used primarily for high-volume applications which require high system throughput. Since high throughput rates are desired, handling only the fields of interest helps reduce time constraints. Since similar documents possess similar size and layout structure, it is straight forward for the image scanner to focus on those fields where the desired information lies. This approach can considerably reduce the image processing and text recognition time. Some application areas to which task-specific readers have been applied include:
The address reader in a postal mail sorter locates the destination address block on a mail piece and reads the ZIP Code in this address block. If additional fields in the address block are read with high confidence the system may generate a 9 digit ZIP Code for the piece. The resulting ZIP Code is used to generate a bar code which is sprayed on the envelope. The flow of mail in a postal address reading and sorting system is shown in Fig.5.
Fig.5 Architecture of postal address reading and sorting system.
The Multiline Optical Character Reader (MLOCR) used by the United States Postal Service (USPS) locates the address block on a mail piece, reads the whole address, identifies the ZIP+4 code, generates 9-digit bar code and sorts the mail to the correct stacker. The character classifier recognizes up to 400 fonts and the system can process up to 45,000 mail pieces per hour.
A form reading system needs to discriminate between pre-printed form instructions and filled-in data. The system is first trained with a blank form. The system registers those areas on the form where the data should be printed. During the form recognition phase, the system uses the spatial information obtained from training to scan the regions that should be filled with data. Some readers read hand-printed data as well as various machine written text. They can read data on a form without being confused with the form instructions. Some systems can process forms at a rate of 5,800 forms per hour.
A check reader captures check images and recognize courtesy amounts and account information on the checks. Some readers also recognize the legal amount on checks and use the information in both fields to cross-check the recognition results. An operator can correct misclassified characters by cross-validating the recognition results with the check image that appears on a system console.
In general a bill processing system is used to read payment slips, utility bills and inventory documents. The system focuses on certain regions on a document where the expected information are located, e.g. account number and payment value.
In order to claim revenue from a airline passenger ticket, an airline needs to have three records matched: reservation record, the travel agent record and the passenger ticket. However, it is impossible to match all three records for every ticket sold. Current methods which use manual random sampling of tickets is far from accurate in claiming the maximal amount of revenue.
Several airlines are using a passenger revenue accounting system to account accurately for passenger revenues. The system reads the ticket number on a passenger ticket and matches it with the one in the airline reservation database. It scans tickets up to 260,000 tickets per day and achieves a sorting rate of 17 tickets per second.
An automated passport reader is used to speed up the returning American passengers through custom inspections. The Reader reads a traveler's name, date of birth, and passport number on the passport and checks these against the database records that contain information on fugitive felons and smugglers.
There are two general categories of page readers: high-end page readers and low-end page readers. High-end page readers are more advanced in recognition capability and higher data throughput than the low-end page readers. A low-end page reader usually does not come with a scanner and it is compatible with many flat-bed scanners. They are mostly used in an office environment with desk top work stations, which are less demanding in system throughput. Since they are designed to handle a broader range of documents, a sacrifice of recognition accuracy has to be made. Some commercial OCR software allow users to adapt the recognition engine to customer data for improving recognition accuracy. Some high performance readers can detect type faces (such as bold face and italic) and output the formatted ASCII text in the corresponding style.
S.N. Srihari, High-Performance Reading Machines, Proceedings of the IEEE, 80(7), July 1992, 1120-1132.