Combining Optical Character Recognition and Object Detection for Document Processing

Let’s face it; Processing documents is tedious and paperwork is boring. Computer vision can help us do less of it!

Confession; I have terrible handwriting. I attribute part of the blame to my childhood where I learned to write cursive in England and then had to switch back to non-cursive in Sri Lanka where cursive is uncommon. It doesn’t help that my ADHD tends to make my writing hurried and careless as well. I like to joke that my brain moves too fast for my hand to keep up with whenever someone complains. 😉

The field of computer vision has seen tremendous development in the recent past leading to a host of practical applications in a wide variety of industries and use cases. Optical Character Recognition (OCR) is one such application of computer vision with the potential to automate many tedious but necessary tasks. OCR technology can be used to process digital documents (PDFs, scanned documents, images of documents and the like), far more efficiently than humans can. In a nutshell, OCR can “read” a document and convert images of text into actual text. Current state-of-the-art algorithms are capable of near-flawless recognition of printed text, with handwriting recognition not too far behind (as long as the handwriting isn’t somewhere between a child’s scrawl and a doctor’s note like mine).

Optical Character Recognition (OCR)

To get a quick taste of OCR, let’s take a look at some examples using Tesseract.

Tesseract is a popular, open-source package that can be used for OCR. The LSTM based OCR engine added in Tesseract 4 significantly improves upon the performance of the previous versions, chalking up another big win for deep learning. There’s a Python wrapper for Tesseract named Python-tesseract (duh) which is fairly straightforward to work with. The package can be installed easily by following the installation instructions given in the link.

Once everything is installed, you can use the script below to test it out. Keep in mind that the accuracy of Tesseract will depend heavily on the quality of the image.


Tesseract transcript:


Tesseract transcript:

As you can see, Tesseract does an excellent job of detecting and transcribing the text in the images. The transcribed text can then be used for many different purposes as required. For example, it can be fed into natural language processing models to perform classification, named entity recognition, question answering, and other NLP stuff.

While it is undoubtedly useful to be able to transcribe a whole image, sometimes we are only interested in a part of the image (think fields in an application form). Also, OCR models can struggle when given a complicated document with lots of formatting, lines, and fields (again, think application form). Even if the OCR model manages to transcribe a complicated document accurately, it is unlikely to preserve the formatting, making it difficult to identify which part of the text is which. Consider the example given below.

Example application form:

Tesseract transcription:

Not nearly as impressive anymore. There’s more than one way to skin a cat (as they say) and there’s definitely more than one way to get around this issue. One such method we’ve used with great success is to partition the image into sub-images and only use the parts that we are interested in. The partitioning can be achieved using another computer vision technique, object detection.

Object Detection

The name does give everything away when it comes to Object Detection. Unsurprisingly, this is a computer vision technique which deals with detecting (and identifying) objects in images. Unless you’ve been living under a particularly large rock (or you are new to machine learning in which case you are forgiven), I’m sure you are aware that deep learning has changed the game here as well.

Mask RCNN is a deep learning-based computer vision model that has shown great success with object detection tasks. Facebook’s Detectron 2 system originates from Mask RCNN and can be used to perform object detection (among other things) quickly, easily, and accurately. It is fairly trivial to fine-tune a Detectron 2 model on a custom dataset, and the power of transfer learning means that we can expect good performance even with tiny datasets. In transfer learning, we take a model that has already been trained on a (typically large) dataset and fine-tune the pre-trained model on a new dataset.

Detecting checkboxes

As an example, we used a Detectron 2 model to detect checkboxes and identify whether or not they were ticked (ticked checkboxes are defined as one class, while unticked checkboxes are defined as another). Despite using a tiny (in deep learning terms) dataset to fine-tune, our model is yet to miss or misclassify a single checkbox on the documents we have tested on so far.

Extracting information

Another example is in situations where we need only a particular piece of information from a document. Here, we would train the Detectron 2 model to detect the part of the document we are interested in and simply crop out the rest of the image. Again, Detectron 2 performs flawlessly despite being fine-tuned on relatively small datasets.

Once the image is cropped using Detectron 2, we use Tesseract to transcribe the cropped image, which now contains only the text that we are interested in. As a bonus, Tesseract OCR is significantly more reliable on the cropped image as it is far simpler than the original and does not contain stray lines and formatting that tends to mess with OCR accuracy.

Cropped image:

Alternate telephone

Tesseract transcription:

Alternate telephone

When the image is cropped and all the complicated formatting removed, Tesseract can accurately identify the text which it had previously misidentified (as ‘Aernate telephone).

Finding out who’s who

Combining OCR with Object Detection is highly useful when we need to extract particular pieces of information, but it can still be difficult to determine what is what (determining which piece of text is the answer to which question for example). A similar problem exists in the checkbox example where it is quite straightforward to find and identify checkboxes, but decidedly less straightforward to determine what the checkboxes relate to. It’s rarely useful to know that there are 5 ticked checkboxes and 3 unticked checkboxes in a document, without the context of what the checkboxes indicate.

One solution to this problem is to use the relative positions of whatever objects we are interested in to define a template for each document. We find that this is easy to implement and easy to generalize to new documents as we just need to define a single template for each new type of document. The template is defined based on the relative positions of the objects to each other and to an anchor point on the document (we use the top-left corner). Note that we don’t use absolute distances relying rather on relative positions. For example, a template can specify a certain number of rows of objects in the image and the number of objects in each row. This is done to ensure that our technique will work regardless of the scale of the images as well as rotation or skew of the image.

The bounding boxes for the objects can be obtained from the Detectron 2 model and these can be used to calculate the predicted positions of the objects on the document. By comparing this information to the relevant template, we can automatically determine which checkbox is related to which question in the document, and then use the predicted class to determine the actual answer to the question. An identical approach can be used when it is necessary to extract multiple pieces of text from a document using the object detection + OCR technique.


In this article, I’ve discussed some potential applications of computer vision techniques related to the processing of documents. The techniques shown here can be used to automate tedious paperwork saving significant time and effort. This is particularly effective on documents containing mainly printed or typed text, but it can also be used with handwritten documents by utilizing handwriting recognition algorithms.
You can see some of the cool stuff we are doing in the real world by using these techniques at the link here.
I am a consultant in Deep Learning and AI-related technology for As part of the Deep Learning Research team at, we work towards making AI accessible to small businesses and big tech alike. This article is aimed towards sharing our knowledge.

Related Posts