Optical Character Recognition Through the Use of a Kohonen Neural Network

Brian Vuyk, student at Redeemer University College
April 12, 2006

Optical Character Recognition (OCR) is a field in computer science, which throughout the latter half of the 20th century received much attention from the scientific community. The ability of a computer to read a series of characters, determine their meaning, and take appropriate action based on the interpreted meaning was a goal sought after, due in part to the high value of the commercial applications of these techniques.

For example, OCR was deemed to be of great value within the postal system, in which typically millions of envelopes must be read and sorted daily, to determine their delivery location. There was also great demand in the corporate world to have the ability to digitize older documents, so that they may be more easily read and modified. Within libraries and other educational institutions, there existed a desire to digitize old books, in order to better store their content. Another use of OCR is to determine the handwriting impressed with a stylus in many personal planners and PDAs.1

The usefulness of any OCR method is directly related to it’s performance. If the OCR document has been correctly recognized with few errors, it can vastly reduce any time required by humans to correct and perfect the document contents. OCR performance is typically measured in terms of unrecognized character, substitution errors, insertion errors, or deletion errors. An unrecognized character is considered any character which cannot be determined by the OCR engine, for which a standard placeholder is inserted. A substitution error is when a character is misrecognized, and an incorrect character is substituted instead. And insertion error is caused when when an extra character is added to a word, such as a ‘w’ decomposing into ‘vv’. A deletion error is caused when a character is missed in the recognization process. 2

There have been a variety of different methods used by OCR to determine the contents of a digitized page. Most typical methods of performing OCR generally are structured similar to the following:

  • Pre-process the image of the page.
  • Locate regions of text on a page.
  • Build bounding boxes around each word or character.
  • Process these boxes to determine their contents.
  • Post-processing (spell checking etc.)

The preprocessing required by OCR nearly always required the binarization of the image. Binarization is the process by which an image is reduced to two main levels of intensity, typically black and white, with black pixels represented by 0, and white regions represented by either 255 or 1.

Prior to, or following binarization, other preprocessing methods may or may not occur in order to reduce extraneous noise, perform normalization, or to assist in line finding.3

One feature of text is that, when viewed statistically on a page, it acquires a certain statistical signature through which it can be commonly divided into regions such as words or letters. For example, the lines of text on this page are oriented in a horizontal fashion. If one was to take an image of this page, and to compute the number of black pixels present in each row of pixels in the image, it would become quickly apparent that a clear difference exists between lines of text, and the white space separating them.

If we were to divide the page into strips along these statistically minimal lines, we would find that the result of this method would be that each strip now contains a row of text. If we were to now statistically analyze each of these individual rows of text in a vertical direction, again measuring the number of black pixels in each column, we would be able to separate the text into words and characters by splitting the rows again along the statistically minimal lines4.