logo
Published on InfoHatter.com (http://www.infohatter.com)

Optical Character Recognition Through the Use of a Kohonen Neural Network

Brian Vuyk, student at Redeemer University College [1]
April 12, 2006

Optical Character Recognition (OCR) is a field in computer science, which throughout the latter half of the 20th century received much attention from the scientific community. The ability of a computer to read a series of characters, determine their meaning, and take appropriate action based on the interpreted meaning was a goal sought after, due in part to the high value of the commercial applications of these techniques.

For example, OCR was deemed to be of great value within the postal system, in which typically millions of envelopes must be read and sorted daily, to determine their delivery location. There was also great demand in the corporate world to have the ability to digitize older documents, so that they may be more easily read and modified. Within libraries and other educational institutions, there existed a desire to digitize old books, in order to better store their content. Another use of OCR is to determine the handwriting impressed with a stylus in many personal planners and PDAs.1

The usefulness of any OCR method is directly related to it’s performance. If the OCR document has been correctly recognized with few errors, it can vastly reduce any time required by humans to correct and perfect the document contents. OCR performance is typically measured in terms of unrecognized character, substitution errors, insertion errors, or deletion errors. An unrecognized character is considered any character which cannot be determined by the OCR engine, for which a standard placeholder is inserted. A substitution error is when a character is misrecognized, and an incorrect character is substituted instead. And insertion error is caused when when an extra character is added to a word, such as a ‘w’ decomposing into ‘vv’. A deletion error is caused when a character is missed in the recognization process. 2

There have been a variety of different methods used by OCR to determine the contents of a digitized page. Most typical methods of performing OCR generally are structured similar to the following:

The preprocessing required by OCR nearly always required the binarization of the image. Binarization is the process by which an image is reduced to two main levels of intensity, typically black and white, with black pixels represented by 0, and white regions represented by either 255 or 1.

Prior to, or following binarization, other preprocessing methods may or may not occur in order to reduce extraneous noise, perform normalization, or to assist in line finding.3

One feature of text is that, when viewed statistically on a page, it acquires a certain statistical signature through which it can be commonly divided into regions such as words or letters. For example, the lines of text on this page are oriented in a horizontal fashion. If one was to take an image of this page, and to compute the number of black pixels present in each row of pixels in the image, it would become quickly apparent that a clear difference exists between lines of text, and the white space separating them.

If we were to divide the page into strips along these statistically minimal lines, we would find that the result of this method would be that each strip now contains a row of text. If we were to now statistically analyze each of these individual rows of text in a vertical direction, again measuring the number of black pixels in each column, we would be able to separate the text into words and characters by splitting the rows again along the statistically minimal lines4.

The result of this method is that we now have characters contained within individual bounding boxes. Once we are ready to begin to determine that nature of the character within the bounding box, a variety of different techniques present themselves.

The majority of OCR techniques can be somewhat divided into two categories: template matching methods, and structure analysis methods. While many early methods are easily separated into these two categories, more recent methods are a hybrid of the two given.

Template matching is generally performed via two processes: superimposing an input shape on a template, and measuring the degree of coincidence between the input shape and the template. This degree of similarity may be measured by a variety of different methods: overlap, Fourier analysis, or by ‘peephole’ correlation5.

Because of the variations of shape present among them, it can be difficult to create templates for handwritten characters. Therefore certain structural analysis methods have been developed to measure such characteristics such as line count and orientation, or relative thicknesses if in Kanji or other eastern fonts.

Structural Analysis has no defined basis in mathematical principle. It is instead a problem which can be attacked from a variety of different intuitive levels. Seeing as it is possible to break a structure such as a character down into parts, it should then be possible to classify both these parts and the relationships between them.6

A neural network may be used to determine the information held within a character or bounding box. Typical neural network OCR techniques are a hybridization of template matching and structure analysis.

A neural network is a series of interconnected nodes called ‘neurons’ connected by ‘synapses’ in such a way as to somewhat mimic the structure of the human brain. A neural network works by assigning each neuron an ‘activation value’, and each synapse a weight. The activation value of a neuron is a value which typically lies between a pair of set thresholds at which point the neuron will fire, and produce an output of a certain value onto the rest of the network. The weight assigned to a synapse is typically a number signifying the amount of resistance to information passing across it.7

There are multiple types of neurons, classified by their transfer functions or summation functions. The transfer function is the function by which an input is processed while firing. The most common transfer function is the sigmoid function, shown below. Other common transfer functions include the hyperbolic tangent, a linear function, sine function and others.

The summation function deals with how a neuron treats its inputs before processing. Because a neuron is often receiving input from multiple sources, how it handles the input greatly affects it's output. The most common method is to sum the inputs before applying the transfer function. Other common methods include obtaining the min or max input value, or to apply a bitwise AND, OR, or XOR operations to the inputs.8

A key feature of a neural network is that it can be taught. By providing a set of sample inputs to the neural network, it can train itself to improve it’s output by continually modifying the weights of it’s synapses to output a proper value for the input given.

Neuron networks have proved to be quite useful for pattern matching and classification. Because of their self-training nature, once a neural network is set up, it can recognize a wide range of objects if the input given to them falls within their input bounds. In this project, I have used a Kohonen neural network to perform the recognition given an image containing a single character.

A Kohonen neural network’s purpose by design is to take a large number of inputs, and place them into a map of single or dual dimensions. In image recognition terms, the input could be a set of pixel values, while the output would be a single output neuron firing representing a certain letter corresponding to the input image.

When designing my neural network-based OCR algorithm, I discovered an online book written by Jeff Heaton in which he describes neural networks in detail, and provides by way of example, an algorithm and implementation for a neural network.9 His algorithm can be described by the following steps:

  1. 1.Find the bounding box around the character
  2. 2.Crop the image around the bounding box
  3. 3.Down sample the image to a 5x7 grid
  4. 4.send the 35 values (after normalization) to the neural network for processing

Due to the fact that implementing a neural network falls well outside the scope of the course, I have relied heavily on Heaton's source code for the actual neural net functions. Some core Kohonen network classes in my sources have been written by Heaton, and are clearly marked as such. These classes fall under a non-commercial use license.

The inputs specified for this algorithm are a binary image free of any noise. This unfortunately removes any real world examples. To create samples for testing, I used the paintbrush tool in the GIMP to create simple letter-shapes for the program to recognize.

The output of this algorithm is a letter.

To find the bounding boxes, I again followed a n algorithm suggested by Heaton. To find the top of the bounding box, simply move down row by row until you encounter the first pixel. The previous row is the top of your bounding box. This is repeated on all four sides until a box is found around the character.

The second step of the algorithm involves cropping the image about the bounding box. This was a trivial step, and accomplished with the cropping code I have written earlier in the semester.

Down sampling the image proved to be an interesting chore. The first step was to break up the image into 35 separate quadrants. This was accomplished by dividing the width of the cropped image by the number of cells we want across (in this case 5). This yielded a ratio of pixels to a quadrant width. This same calculation was performed to find a ratio of pixels to a quadrant height (7 cells). These number were then used to step through each pixel in the quadrant to determine if a pixel existed within the quadrant. If a pixel was found, then the quadrant was considered occupies.

The data was stored in a 35-position boolean array, with true indicating that the quadrant was occupied. This data was then copied into a double array, and passed into the neural net for processing.

Before the neural net could begin processing the data, it had to be trained. I was able to obtain a set of sample data for the first 26 capital letters of the alphabet. This data was used to quickly calibrate the array before processing.

Once the neural net delivered the index of the output neuron which fired for the input data, I had to determine which letter corresponded to the node which had just fired. This was accomplished by feeding the sample data back into the net until I located a value which consistently caused the same output neuron to fire. I then retrieved the letter which was stored in the sample data file with the data. This letter was then returned to the user via a JOptionDialogue.

The Kohonen network as it is implemented in the CSC371 final project consists of two layers. The first layer is the input layer, composed of sigmoids, which sum the input before the transfer function. This layer contains 35 neurons, one for each element in the input data. The second layer consisted of 26 output neurons, one for each of the possible letter-outcomes. The output layer consisted of sum-based sigmoids similar to the input layer.

The input layer was connected to the output layer by a full synapse. This means that every node in the input layer was connected to every node in the output layer by a one-way link.

In tests, it appears that the neural network performs quite well when presented with relatively neat handwritten characters. When presented with more sloppily formed characters, the recognition capabilities of the neural network began to fall off a bit. An interesting item to notice is that in many cases where the neural net gives a wrong result, visual inspection of the data often shows a similarity between the erroneous guess of the neuron network and the visual appearance of the letter. In one case in particular, the neural network nearly always incorrectly recognizes a poorly formed 'B' as an 'S'. The form of an 'S' can be seen somewhat in the sloppy shape of the 'B'.

While it is true that a neural network has some shortcomings due to stringent input formats, I believe that it can be of great use in pattern matching, especially in OCR. Due to the statistical nature of forming guesses based on the training set, a neural network performs surprisingly well with mild to moderate variations of the input character.

Footnotes
  1. Udo Miletzki, "Character Recognition in Practice Today and Tomorrow," ICDAR, p. 903, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997.
  2. W. Cushman, P. Ojha, and C. Daniels, "Usable OCR: What are the Minimum Performance Requirements?", Proc. ACM SIGCH11990.
  3. G. Nagy, At the Frontiers of OCR, Proc. IEEE, Vol. 80, No. 7, pp. 1094, 1992.
  4. Jaekyu Ha, R.M. Haralick, I.T. Phillips, "Document page decomposition by the bounding-box project," ICDAR, p. 1119, Third International Conference on Document Analysis and Recognition (ICDAR'95) - Volume 2, 1995.
  5. S. Mori, C.Y. Suen, and K. Yamamoto, Historical Review of OCR Research and Development, Proc. IEEE, vol. 80, no. 7, pp. 1,030, July 1992.
  6. Mori, Suen, and Yamamoto 1992 pg. 1032
  7. http://www.heatonresearch.com/articles/2/page2.html Neural Network Structure (Accessed April 12, 2006)
  8. http://www.dacs.dtic.mil/techs/neural/neural2.html, Artificial Neural Networks Technology (Accessed April 12, 2006)
  9. http://www.heatonresearch.com/articles/series/1/ Introduction to Neural Networks is Java (Accessed April 11, 2006)
Bibliography
  1. Udo Miletzki, "Character Recognition in Practice Today and Tomorrow," ICDAR, p. 902, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997.
  2. W. Cushman, P. Ojha, and C. Daniels, "Usable OCR: What are the Minimum Performance Requirements?", Proc. ACM SIGCH11990.
  3. G. Nagy, At the Frontiers of OCR, Proc. IEEE, Vol. 80, No. 7, pp. 1093-1100, 1992.
  4. Jaekyu Ha, R.M. Haralick, I.T. Phillips, "Document page decomposition by the bounding-box project," ICDAR, p.1119, Third International Conference on Document Analysis and Recognition (ICDAR'95) - Volume 2, 1995.
  5. S. Mori, C.Y. Suen, and K. Yamamoto, Historical Review of OCR Research and Development, Proc. IEEE, vol. 80, no. 7, pp. 1,029-1,058, July 1992.
  6. http://www.dacs.dtic.mil/techs/neural/neural2.html, Artificial Neural Networks Technology (Accessed April 12, 2006)
  7. http://www.heatonresearch.com/articles/series/1/ Introduction to Neural Networks is Java (Accessed April 11, 2006)

Source URL:
http://www.infohatter.com//optical_character_recognition_through_the_use_of_a_kohonen_neural_network