To Blog

Text Extraction via Optical Character Recognition

July 15, 2021

Machine Learning (ML) based document processing is becoming an essential component of unstructured data analytics. In this blog we dive into modern optical character recognition (OCR) processing, discussing big tech and startup solutions for converting unstructured data sources in documents, images, and videos into structured data ready for analytics to drive insights.

In Figure 1 below we show a high-level general architecture of a natural language processing (NLP) task. There are two streams of data flow in document-based NLP analytics:

  1. Structured textual data (such as already digitized text, JSON, XML, etc.) that goes directly to NLP, or
  2. Unstructured textual information embedded in documents (such as PDF, images, and videos) that first goes through an OCR module to extract the text from these unstructured documents.
Figure 1: High-level NLP analytics pipeline from both structured and unstructured sources. (Credit)

The OCR Pipeline

OCR is used to convert text embedded in scanned documents, images, or videos into a format that is easily editable, searchable, and ready for downstream NLP analytics task. It requires a combination of computer vision (CV) modules, recognition (ML) modules, and text modules to extract the text into a readily useable structured form (Figure 2). The two main components of the CV modules are the preprocessing and segmentation stages. These two components are used to prepare the input for an OCR engine while the postprocessing stage is used to convert the raw text into structured text output.

Figure 2.   A high-level generalized OCR pipeline with the computer vision modules, ML-based or traditional feature extraction and character/word recognition, and text processing modules.

Real-world scanned documents and text embedded in images and videos are not readily available for applying OCR directly. Most often, the input needs to pass through several preprocessing stages before the OCR is applied. Documents may be in various orientation, deformations can happen because of the scanning process, and noises may be introduced because of the scanning cameras and/or the environment in which the scanning happened.

Stage 1: OCR Preprocessing

The input scanned documents or images are usually not in an ideal size, shape, and orientation. Several preprocessing stages need to be applied to increase the overall recognition accuracy of the OCR system. This is the most crucial stage next to the actual OCR engine itself. Among the most used preprocessing algorithms applied are cropping, alignment correction, distortion correction, binarization, and denoising (filtering noise out) of the input document or image.


Cropping the relevant region of interest (ROI) that contains the text of interest is the first step in preprocessing for OCR. Automatic cropping may be achieved by training a dedicated cropping model or using existing OCR engines and some heuristics and image processing to detect the rough boundaries of all detected text in the image. Figure 3 shows an example cropping of a receipt from the background.

Alignment Correction

Most scanned documents and text source images are not properly aligned, and this has a significant implication in the accuracy of the OCR result. From simple histogram projection (see Figure 4) to model-based alignment correction may be needed depending on the complexity of the input document.

Distortion Correction

Several factors can create distortion in the scanned image or document that contains the text of interest. Geometric shape rectification methods such as trapezoidal distortion correction and line straightening distortion corrections may be needed before sending to the OCR engine (See Figure 5).

Contrast and Sharpness Correction

Increasing contrast between the text foreground over the background could increase OCR accuracy. Similarly, adjusting the sharpness of edges of the text can help in text segmentation before OCR as well as during OCR. Histogram equalization based adaptive contrast enhancement is usually applied in OCR.

Denoising and Deblurring

Reducing blurring and noise in the source document image also improves the accuracy of the text recognition by the OCR engine. Image processing via Gaussian, median, and bilateral filtering can be applied to remove or reduce the effect of Gaussian blur, salt and pepper noises, and general noises, respectively. General denoising and deblurring also can be achieved by using deep auto encoder denoising models. Figure 6 shows the effect of removing an ISO noise introduced due to image sensor imperfections.


The last preprocessing step is image binarization with foreground text on a constant color background. Binarization may be achieved by turning the foreground text into white on a black background or vice versa. Figure 7 shows a comparison of a simple global thresholding with an adaptive thresholding approach to binarizing the original scanned image.

Figure 7. Comparison of simple global thresholding with adaptive thresholding to binarize an input scanned image.

Stage 2: OCR Segmentation

The last stage of preprocessing is segmentation of the scanned document or image into manageable chunks that returns segmented words for the OCR engine.

Line/Word/Character-level detection

There are several text detection methods,  ranging from traditional image processing to model-based detection. Traditional image processing-based text detection methods include Stroke Width Transform (SWT) and Maximally Stable Extremal Regions (MSER) that extract text regions based on edge detection and extremal region extraction, respectively. Deep-learning model-based methods may include Connectionist Text Proposal Network (CTPN) and Efficient and Accurate Scene Text Detector (EAST).

Based on the type of text detection, these may range from character-level detections to line-level detections. Figure 8 shows the 3 levels of text detections.

OCR Feature Extraction & Recognition

The core component of any OCR pipeline is the OCR engine. Once the entire document is preprocessed and segmented into manageable chunks, usually in the form of segmented words, the recognition step is either applied directly, or a further feature extraction stage may be applied. Deep learning-based solutions combine both feature extraction and recognition in a single model. Since the input is in the form of images, convolutional networks handled the input best. Due to the sequence nature of the text, a combination of Convolutional Neural Networks (CNNs) and bidirectional Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNNs) are usually applied as a recognition unit. The final output of the bidirectional LSTM layers is fed into Connectionist Temporal Classification (CTC) layer to convert the sequence into feature vector ready for word classification.

Text extraction from documents can be classified into three types depending on the type of document processed.

Unstructured Documents

Unstructured sources of text include free flowing text in scanned images such as books and text embedded in images and videos.

Structured Documents

Structured documents are those in which the embedded text is structured but they are scanned copies or images. Examples include formal structured forms such as IRS form 1040. These documents are processed specially by isolating individual units of text from the form and matching that to a document template.

Semi-structured Documents

Semi-structured documents include a combination of both structured embedded texts in forms as well as free floating text. Examples include purchase orders, receipts, etc.

OCR Big Tech Tools

As is the case with multiple tools, the following big tech companies have OCR solutions bundled with their cloud offering. These OCR-based solutions provide the most basic components needed for document processing via OCR.

  • Google Cloud Vision API: Enables text extraction from images and has two primary annotations for character recognition:  1) Text Detection that detects and extracts text from an image, and 2) Document Text Detection that extracts text from an image and is optimized for dense text (handwritten or in images) and documents (PDF / TIFF). Google has a freemium business model for Vision API involving feature driven usage-based pricing beyond 1000 units/month.
  • Amazon Textract: ML service by AWS to automatically extract text, handwriting, and image data from documents, forms, and tables using OCR with SOC, ISO, PCI, HIPAA, and GDPR compliance. Textract is unique in that it provides Amazon Private Cloud support endpoints via AWS Privatelink that allows customers to avoid public cloud usage for heightened security. Additionally, customers can also request human reviews to manage sensitive workflows. Textract is aimed at document detection/analysis and offers a usage-based pricing model with volume discounts on 1M+ pages/month.
  • Microsoft Computer Vision API: Full-stack suite aimed at automating text extraction from images, documents, and real-time video using visual data processing for automated data labeling, image description generation, text extraction, and content moderation. Freemium business model with the first 5,000 transactions free and a tiered usage-based pricing model for 1M, 10M, 100M, and 100M+ transactions/month.

End-to-end Document Processing Platforms

Most startups in this space build end-to-end document ingestion, processing, analytics, and visualization platforms. The following are among the top OCR-based document processing startups.

  • Impira: is building an AI platform to automate workflows on unstructured data from documents, text, voice, and video content using a combination of OCR and statistical modeling algorithms that can extract meaning from these large-scale datasets for a “system of record” solution. The platform is designed to process semi-structured forms and documents.
  • Automation Hero: is developing a no code robotic process automation (RPA) platform with OCR at its center for deep learning-based ML models for each stage of the OCR pipeline. In addition, the platform provides Excel like function-based process automation tools to visually build specific pipelines and can process any document including handwritten documents.
  • HyperScience: is building a full-stack document ingestion, classification, extraction, and processing platform. The company’s solution is capable of ingesting diverse document types (PDFs, forms, handwritten notes) directly from the source (folder, email, message queue) and using OCR ML models (on-prem / cloud) to classify and extract info shared using a downstream API.
  • Veryfi: is developing a full set of solutions to process semi-structured forms and documents such as receipts and invoices, as well as full automation tool suite for bookkeeping solutions such as expense reports. It provides mobile and email document capture apps.

Open-source Tools

  • OpenCV: is an open-source image processing library written in C++/C with bindings offered for python and Java. It provides a comprehensive list of image processing and machine learning functions. OpenCV was designed with performance in mind especially for real-time vision applications such as vision in robotics. OpenCV has several OCR preprocessing and recognition functions such as MSER, noise and blur filters, and OCR engines.
  • Leptonica: is a general-purpose image processing/analysis library that provides some of the preprocessing functions to an OCR pipeline such as text segmentation and binarization. It is used by Google’s Tesseract (see details below) for these two preprocessing stages.
  • ImageMagick: is a library that provides several image processing functions with many command-line options for customization. It provides functions such as image scaling, cropping, rotation, and affine transformations.
  • Unpaper: is designed with enhancing the quality of scanned paper documents Originally created to make scanned book pages more readable on screen, the same set of tools can enhance scanned pages before providing the input to an OCR engine. It can provide several document enhancements functions such as noise filters, black filters, blur filters, and auto skew correction.
  • Google Tesseract: is a comprehensive OCR engine that provides basic preprocessing capabilities such as binarization and segmentation using Leptonica. It supports more than 100 languages with the possibility to retrain for further language support. Google uses Tesseract to detect text in videos and spam in Gmail.
  • GIMP: is a GNU Image Manipulation Program. It is designed to be used as stand-alone application for individual image editing and manipulation as well as a framework for scripted image manipulation via C/C++, Perl, Python, etc. Its functions can be extended using plugins.

Market Landscape

Manual document processing is slow, tedious, often results in several errors, and requires employees to perform repetitive, monotonous tasks that add little value while diminishing efficiency and job satisfaction. AI-driven OCR solutions that provide full-stack support from data ingestion, procession, analysis, and visualized insights will become the de facto method while reducing a business’s operational overhead costs. McKinsey & Co. report that workplace automation solutions powered by OCR could improve operational activities by more than30%. OCR has the potential to revamp the daily operations, and market research firm KBV estimates the OCR market opportunity to reach $12.6B by 2025 at a CAGR of 12.5% during 2019 – 2025.

Workflow automation needs have accelerated due to the tailwinds from COVID-19, and VC funding for companies in this space is at record highs. Over $250M in funding across 24 deals in 2020 and >$135M in funding were invested across 10 deals in 2021 alone compared to average ~$130M funding per year across ~15 deals since 2014.

Open-core, usage-based business models will replace legacy, licensing-driven contracts with the tech giants (Google, Amazon, and Microsoft) competing for share among F500 companies leveraging their cloud computing capabilities. However, much of the innovation within the SMB and niche enterprise IT customer base will likely be captured by end-to-end document processing startups that provide more customization and flexibility in terms of technical enhancements.


Although open-source tools may help build bespoke OCR pipeline and are useful for pre-processing documents, dedicated startup platforms can provide end-to-end tools for the life cycle of the unstructured document to structured text processing. In addition to OCR-specific capabilities, they provide user and project management tools, robust dashboards to track and visualize statistics, and labeling tools for semi-structured documents among other features. The big tech companies may also provide modularized solutions for each component of the OCR pipeline. Enterprises requiring OCR have a bevy of options, whether a comprehensive end-to-end system or modularized solutions.

IQT Blog

Insights & Thought Leadership from IQT

Read More