OPTICAL CHARACTER RECOGNITION (OCR)

Simply stating Optical Character Recognition (OCR) is the art of extracting text from images or any form of document. OCR can extract text from a scanned document or an image of a document.

CAPTCHA

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. In other words, CAPTCHA determines whether the user is real or a spam robot. Below is what a form of captcha looks like

 

 

 

READING CAPTCHA (OR ANY TEXT FROM IMAGE)

To read the captcha we will use a utility called as OCRopus. OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do some image preprocessing, and possibly also train new models. It provides us utilities that helps in reading the images.

WORKING MODEL

Ocropus reads the image successfully once it is passed through the process of

Binarization > Page Layout Analysis > Recognition

or sometimes we need to train the model to recognise the parts of image, in that scenario the flow changes to:

Binarization > Page Layout Analysis > Training > Recognition

The utilities used for the above process are mentioned below:

  • Binarization: ocropus-nlbin
  • Page Layout Analysis: ocropus-gpageseg
  • Training: ocropus-rtrain
  • Rcognition: ocropus-rpred

 

INSTALLING OCRopus

In order to install OCRopus, fire up your terminal and issue below commmands:

$ git clone https://github.com/ocropus/ocropy
$ sudo apt-get install $(cat PACKAGES)
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ sudo python setup.py install

 

READING CAPTCHA

For our test we will try to read the below image (4.png), although it seems simple remember it has lot of black and white noise. And if you are unaware, this noise and extra characters serve the purpose of defeating OCR solutions.

 

 

  • The first step is Binarization, in which OCRopus converts grayscale image to Black And White.
ocropus-nlbin -n images/4.png -o images/out

The result will be two png files in out directory as specified by the -o switch

  • images/out/0001.bin.png    (Useful flattened image) [top]
  • images/out/0001.nrm.png   (Not so useful, non-flattened image) [bottom]

 

 

 

  • The second step is Page Layout Analysis in which it tries to find the individual lines of text. You might need to play with the scale value to get the correct image.
ocropus-gpageseg -n -scale 45 images/out/0001.bin.png

This will give two files in output (the numbers could differ)

  • out/0001.pseg.png
  • out/0001/0001/010001.bin.png

 

 

  • The third step Recognition involves reading the characters from image. Here we provide our downloaded english model via  -m  switch
ocropus-rpread -n -m models/en-default.pyrnn.gz images/out/0001/010001.bin.png

 

But unfortunately this did not give us the correct output. It was able to recognize only some parts of image , as you can see in red box it says 234zV

 

 

Lets move back a step and try to teach our model the correct characters. We will be heading back to Training phase. In order to train the model, you need to create a file in the same folder from last step where your 010001.bin.png resides.

The file should say what captcha says, i.e. 263S2V

Name of the file should be same as our image and extension of .gt.txt

 

Now lets start the training with ocropus-rtrain, Here we have t provide OCRopus what is called as Truthdata.

ocropus-rtrain -o models/testmodel images/out/0001/010001.bin.png

 

Here we are saving our model in models directory as testmodel with -o  switch

You should get to train your model if you have lots of similar images. This is performed with just single image for demonstration purpose only.

In the output you will see three heading labelled as:

  • TRU  Truth data
  • ALN Alternate of model output that is being aligned to model data
  • OUT Output from the model

 

Keep it running for a while as it learns the characters. After every 1000 cycles it saves the modal in the filename specified. You can change that behavior by specifying the -F switch

 

After a while the output in TRU and OUT were same. Lets try to run the recognition again providing our model name

ocropus-rpread -n -m models/testmodel-00001200.pyrnn.gz images/out/0001/010001.bin.png

 

And now it recognizes the characters all fine.

 

Remember i performed this demonstration with just single image to explain the steps. You should train your models with multiple images.