January 14, 2021

While there is significant digitization for information exchange within an enterprise, when it comes to exchange with customers and other business parties, a lot of information is still exchanged in its physical form – via paper. A large volume of legacy information too exists as physical paper and this would have tremendous value, if it were converted to digital.

Traditionally, a large pool of the human workforce as part of operational processes, read, infer and extract information from these documents and feed it into operational systems for further processing. With the advent of RPA, Machine Learning and AI technology, we have entered an era of automated information or data extraction. These technologies have tremendous potential to change the way enterprises work and do business.

Document Varieties and Their Formats

Most information from physical documents is scanned using scanner machines and saved in pdf format or other image formats like jpg, png, or tiff. If the information is not too old and legacy-led, it can also exist in regular text pdf format (searchable). Various elements can be free flow text or paragraphs, label and value pairs, tables, charts or figures or pictures, bar codes, text printed on stamps, smaller areas of image like logo or signature, drawings, handwritten text amongst many other such elements. The representative variety of document types that enterprises deal with are:

  • Agreements: These include all forms of legal agreements. Extracting key metadata attributes, specific clauses text and inference-based outcomes are usually needed. It may also involve inferring which specific group/s within the organization need to act on a clause.

  • Proof of Identity / income / address documents: It is required documentation to extract key information about a person or business. Layouts and formats can vary based on country, state, region.

  • Transactional documents: These include documents such as invoices, work orders, etc. The format and layout can be fixed or variable.

  • Table-based documents: Some documents such as pricing sheets, product specifications etc. are table based and the entire information in the table is required to be extracted and fed to other downstream systems.

  • Drawing documents: These include diagrammatic representations such as geographical area map, perspective drawing or any other such drawing. Requirement could be to extract only certain elements of labels and values from the drawing or it can be as complex as extracting most information from the drawing so that it can be reproduced in another system.

  • Pre-print forms with input: Lot of application / survey forms contain pre-printed instructions and allow users to either input details as numbers and characters in boxes and combs or allow free handwriting.

  • Product pictures: Many times, information such as brand name, weight or volume, nutrition facts, ingredients, etc. must be read from pictures of product packages.

The Role of Foundational Technology

Technology has made it possible to understand document layout, content present in the document, associate labels and their values, and in turn extract key information from these documents. These technology components are available as paid or open source software and include:

Computer Vision - Every page of the document in scanned form is an image. Hence computer vision software libraries are used first to understand and detect every element of the image that is of interest such as paragraphs, tables, logos, hand-written text, boxes, etc. using techniques such as thresholding, contouring etc. OpenCV is the most widely used open source software.

Optical Character Recognition – Once a region of interest is detected, OCR libraries extract all text characters present in the region. The OCR library has an AI model that is trained with large amount of types character sets of different fonts and sizes.

Natural Language Processing – NLP is used to interpret that language of the document when information is present in the form of clauses as part of any agreements. It can help identify the entities and their values. Open source libraries such as Spacy, NLTK, RASA are popular which are pre-trained in interpreting grammar and can also be trained to extract values of trained entities.

Intelligent Character Recognition – ICR technology is used for handwritten character recognition. It involves AI models that are trained using large amount of training data for handwritten text images and annotated actual text values. While there are enterprise products that offer ICR, for specific problems one may use an existing pre-trained open source model and apply specific training in addition to it.

Data extraction uses a combination of the above foundational technology components.

Challenges in Data Extraction

In most cases, documents are practically available in image form. Typically, many noise and quality elements pose a challenge to information extraction and data transformation to a digital format. These includes watermarks; pen scribbles; wrinkled, torn, discolored, smudged, stamps imprinted on the text, randomly occurring black and white grains, dark backgrounds, faded ink, printed with low-contrast or colored ink, scribbling on printed text, poor dpi of the scan. Tables may or may not have grid lines and can get complex with merged and spilt cells, tilted tables, unclear boundaries and many other derivatives. Furthermore, cursive writing, and no clear character separation in handwriting make it difficult to extract information from hand-written text. Deriving inference from multiple related sentences present in a clause or section is another challenge while processing legal documents. Understanding the layout and placements of key information in documents that do not have a fixed format like an invoice and extracting information from these have their own challenges.

This is why we need another layer of extraction solution over and above foundational technology that can solve specific data extraction problems. A business user friendly workbench to verify and correct extracted information adds a lot of value. In the next blog in this extraction series, we will see elements of an extraction solution that can address the above challenges as we peek into future advances in the data extraction and digital data transformation space.

Devang Patel heads solutions in the Incubation Group of TCS' Research and Innovation (R&I) unit. With over 24 years of experience in the IT Industry, Devang has mastered the art blending technology and business towards successfully incubating an idea into money making products and solutions for TCS. As a practitioner, Devang is responsible for ideating, strategizing, conceptualizing, designing and shaping ideas and leading them to a successful scale-up. Devang holds a Bachelor’s degree in Computer Science and Engineering from University of Mumbai, India, and an MBA from the Narsee Monjee Institute of Management Studies, Mumbai, India. Devang's current technology interest is in Artificial Intelligence and Cognitive Information Extraction.