December 24, 2020

The emphasis on data quality in enterprise systems has increased manifold as data processing has become intricately linked with data insights in the current times. Organizations are increasingly using data analytics to help drive business decisions in this digital era. The growing use of artificial intelligence (AI) and machine learning (ML) applications are further complicating data quality processes. Though data plays a vital role, its quality remains an issue for making business decisions using analytics. Inaccurate data or data of inferior quality can mislead organizations. Hence, if the data is inadequate, ML and AI tools can be of little help. The success of ML and AI modes are completely dependent on data quality.

The Need for User-friendly Data Preparation

In the new analytical world, data is also used by non-IT users, who require answers to specific questions for decision making. This has resulted in the emergence of Self-Service Data preparation by automated tools. Almost all analytics data models spend 70 to 80 percent of their time on tasks such as profiling, cleaning, transforming, combining and shaping of data. This investment is necessary to ensure that the raw data can be converted into reliable, useful information to drive business decisions. In order to achieve this, data quality is redefined according to traditional data quality parameters (data cleaning) and is used for data preparation or wrangling, which is a process of finding, treating, combining, merging and transforming data using different statistical or ML measures. The analytical data quality requires comprehensive data cleansing and data preparation to understand the insights of data-driven analysis for analytical decision making. There is a new emerging concept called Citizen Data Scientist, where non-IT and non-statistician data users perform complex analytical modelling using these products. Hence, the aim is to develop products with a simple user-friendly graphic user interface (GUI), based on self-service that has easy features such as drag and drop.

The Data Quality Benchmark

The analytical data quality can be assessed by measuring the ‘Three Cs’, which cover major data cleansing and data preparation processes. The factors constituting the ‘Three Cs’ are:

  • Completeness: There must be a check for missing and corrupt data. This is a part of the data cleansing process to check for missing instances, null values and counts.

  • Correctness: It is important to validate data accuracy. This is a part of the data cleansing and preparation process to check formats, data types and methods of treating data.

  • Consistency: It is essential to standardize identifiers. An important part of the data preparation function is to standardize, bucket and transform data.

These three dimensions address data quality aspects for creating analytical databases. The major tasks of data cleansing and preparation intersect with these ‘Three Cs’ of data quality. These activities include fixing data errors, adding missing values, combining, bucketing, transforming, deleting rows, deriving new columns supported by frequent profiling, visualizing data to validate it and identifying relationships among the data.

Metadata Management

The ML and AI analytical models use large volumes, varied formats and various types of data. Metadata management strategically extracts value from larger data and is easier and more efficient for users to build, deploy and operate. Majority of the products for analytics, data science, quality and governance have started building metadata-based data management solutions.

This has triggered a paradigm shift in the data quality capabilities of software tools as they incorporate the additional data preparation functionality for statistical or ML-based data treatment. This enables transformation with self-service capability for Metadata management. The data treatment or wrangling is dependent on clean data. Hence, analytical data quality is a sequential activity, where data is first cleaned and then prepared. The new software tool for data quality or analytics must be equipped with a self-service data pipeline with metadata solutions.

The typical data pipeline analytical data development workflow is depicted in the following figure:


Data Quality Management – The Way Ahead

The on-premise and cloud-based software tools for analytical data development are providing functionalities with self-service pipelines and Metadata-based platforms. Organizations are adopting AI or ML-based data decision making in accordance with data availability. An interesting development happening in organizations which are moving their data to cloud is that they are performing on-premise data processing before moving the data to cloud.

The emergence of roles such as the Chief Data Officer (CDO) and data democratization activities ensure faster decision-making processes, improved customer experience and operational efficiency. All this is evolving the data quality processes with added emphasis on data cleansing and preparation based on the ’Three Cs’. New and existing products are aligning with this paradigm to enhance data quality management and remain relevant in the market.

Dr. Anantha Desik is currently spearheading data analytics, modelling, machine learning and deep learning with the integrated data management platform, TCS MasterCraft™ DataPlus at Tata Consultancy Services. He has also played a significant role in setting up the data analytics & actuarial COE for the insurance business in TCS. He has over 30 years of experience cutting across business consulting and delivery in different industry domains such as insurance, healthcare and finance. He has published many papers in conferences and journals on digital technologies. Dr.Desik holds a Ph.D degree in Operation research , Executive MBA from IIM Kozikode.