Data Extraction with Intelligent Document Processing and Automation

Devang Patel

Head CTO TCS Incubation

In my previous blog, we saw the various formats in which enterprises receive documents for information extraction and the role of foundational technology in enabling it. We also observed the challenges faced, such as noise, quality, region of interest detection, document classification and many more. Now, I will explain how an extraction solution built on top of the foundational technologies can address these challenges. Below are some of the solution elements that solve specific extraction problems:

Noise elements and quality issues: Image correction techniques are applied to reduce and correct the noise present in the document. Geometrical affine transformations can help correct tilt and orientation issues. Gaussian and Laplacian filters can be applied to remove noise from the document image. Background logos, textures and patterns can also be removed using thresholding techniques.

Region of interest (ROI) detection: Identifying and narrowing down on the ROI helps in focusing on key areas of interest and cutting out background noise present in the document. This is achieved using either image analytics or deep learning-based models that detect word blocks by drawing boundaries around words and later extract information from these word blocks.

Document classification: Documents are classified using deep learning-based models trained for image classification, combined with regular expression-based classification. This helps in identifying the document type and applying focused secondary processing to improve extraction accuracy.

Handling variable layouts: Understanding of geometry, placement and associated content of different element blocks present in the document helps in associating values with labels or positioning elements with each other. This helps extract information from documents, which are usually varying in formats and layouts.

Inference-making: Language models offered by natural language processing (NLP) software can be trained to recognize context-specific named entities using transfer learning. Part of the speech capabilities of NLP libraries can help dissect and associate multiple sentences and arrive at an inference from the clauses present in legal documents.

Hand-written text and multiple languages: Separating printed text from handwritten text helps in processing each type of information separately and applying secondary processing specific to each type, resulting in better extraction accuracy. Multilingual support from underlying foundational optical character recognition (OCR) and NLP technology is important to handle documents in multiple languages.

Processing: A combination of the above are applied in a processing pipeline to solve a specific extraction problem. Files can be fetched from existing document repositories using adapters that are compatible. Extracted information can finally be exported in the form of Excel or JSON files or exposed through APIs for integration with enterprise applications.

Business-friendly workbench: A business-friendly workbench is key to technology adoption by business users. It can allow users to train extraction models, submit files, execute extraction jobs and verify and correct auto-extracted information. Providing confidence scores for every extracted value can help users decide on the amount of attention required in verification. Closed loop learning and re-training the extraction model based on user corrections is key in improving accuracy of extraction.

Looking to The Future

Availability of better computing power and more training data have made the use of artificial intelligence and deep learning increasingly prominent in solving extraction problems. GPUs and technology advancements now enable faster training and execution and address multiple unsolved problems with improved accuracy.

Computer Vision: Instead of using traditional image-processing algorithms, we can now plug in deep learning models within computer vision libraries to clean noise and do many other image enhancements with better precision. Deep learning models used for detecting logos, pictures and signatures in documents are improved by object detection.

Optical Character Recognition: Newer versions of OCR software come with integrated long short-term memory (LSTM) deep learning models that help improve OCR accuracy by learning the sequence of characters or words. They also understand more languages and facilitate their extraction seamlessly. The OCR model can also be trained with data specific to an enterprise context to help further improve accuracy.

Natural Language Processing: Advancements in NLP – such as transformer architectures with the use self-attention mechanism – significantly improve many NLP capabilities such as translation, search engines, sentiment analysis and more. The recently launched BERT model is especially useful in arriving at inferences from legal clauses.

Intelligent Character Recognition: Traditionally, Hidden Markov Model-based techniques have been used for handwritten character recognition. Typical processing for hand-written data extraction involves segmentation at line, word or character levels; pre-processing for de-slanting, contrast and image improvements; character recognition using deep learning models; and then post-processing to predict the right character or word or group of words.

Use of Generative Adversarial Networks (GANs): Deep learning-based GANs can be trained to remove noise elements from documents, or they can be used to generate synthetic data that can be further used for training extraction models.

Making the Right Choice:

We have made a successful transition from manual to automated data extraction. Faster cycle time and improved productivity deliver a positive return on technology investments. Enterprises should choose the right automated extraction solution that can offer coverage across a wide variety of documents and business functions with a good degree of extraction and accuracy. We are certainly looking at an exciting journey ahead in the world of information extraction.

About the author

Devang Patel

Devang Patel heads solutions in the Incubation Group of TCS' Research and Innovation (R&I) unit. With over 24 years of experience in the IT Industry, Devang has mastered the art blending technology and business towards successfully incubating an idea into money-making products and solutions for TCS. As a practitioner, Devang is responsible for ideating, strategizing, conceptualizing, designing and shaping ideas and leading them to a successful scale-up. Devang holds a bachelor's degree in computer science and engineering from University of Mumbai, India, and an MBA from the Narsee Monjee Institute of Management Studies, Mumbai, India. Devang's current technology interest is in artificial intelligence and cognitive information extraction.

Where adaptability meets advantage

We deliver excellence and create value for customers and communities.

Discover why customers choose TCS

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent News

Global knowledge. Local support

Where adaptability meets advantage

We deliver excellence and create value for customers and communities.

Discover why customers choose TCS

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent News

Global knowledge. Local support

Extraction Solution: A swiss army knife for solving extraction challenges

About the author

Devang Patel

Where adaptability meets advantage

We deliver excellence and create value for customers and communities.

Discover why customers choose TCS

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent News

Global knowledge. Local support

About the author

Devang Patel

Related reading

Accessibility Adjustments