Corporate Themes

Crafting Smarter Data Cleansing Processes

 
January 23, 2018

As the opportunities to collect information from across touch points within and outside the organization continue to grow, the sheer volume, variety, and velocity of the data that flows in can be quite daunting. From social media and edge sensors to enterprise resource planning (ERP) and customer relationship management (CRM) systems – each data stream is unique and generates files in different formats. This presents a peculiar problem—56% CEOs have concerns over the integrity of data that is being used to support business decisions.

This is actually not surprising. According a Salesforce survey, data quality tops the list of challenges that companies are currently grappling with. In the US alone, bad data costs businesses approximately US$3.1 trillion annually and erodes a company’s revenue by 12% on an average. Clearly, there is an issue with data credibility, and the reasons can range from incomplete or duplicate records, poor acquisition methodologies to abnormalities in terms of storage and formatting. This forces data scientists and CIOs to spend as much as 60% of their time cleaning and organizing data meaningfully, diverting their efforts from the actual task of building analytical and predictive models based on these datasets.

Coding-Free Cleansing

Typically, organizations have relied on data scientists and their intuition to code the cleaning process for incoming datasets. The current set of cleansing tools are capable of efficiently manipulating messy datasets and organizing them into neat tables, but do not offer any insights or recommend treatment options.

Using open source technologies such as R, Spark and Python, and embedding knowledge gleaned from data scientists and domain experts across industries, it is possible to build an automated scrubbing engine that can also provide recommendations.  For instance, to impute missing values, this system can generate descriptive insights for every column in the dataset, besides recommending the right scrubbing techniques. In terms of input, it can extract data in Excel and CSV formats, and can be further enhanced to accommodate PDF, TXT, and other such file types. While analysts will still need to intervene in the cleaning and pre-processing stages for handling domain-specific, complex data, they will be relieved from the tedious task of coding these processes.

Learning On The Job

Not only will this engine reduce manual effort, it can enable employees, with little or no analytics expertise, to perform data pre-processing, making this a highly cost-efficient solution. Moreover, this tool can manage, intermittently store, and clean data at an enterprise level while allowing end users to access structured datasets once the process is complete. Over time, as the model matures, it could work with a range of treatment methods and insights.

Through recommendations provided by this scrubbing engine, users can learn about cleansing techniques for each data type. This will also help train resources by giving them contextual understanding of distinct scrubbing methods and their applications. In the near future, this engine can grow ‘intelligent’ enough to completely stop relying on analysts, and take independent decisions on the cleaning processes to be performed based on data variables across domains.

What other possibilities do you think such an automated solution can help your enterprise realize? Please tell us in the comments section below.

Saloni Kakkar works as Data Scientist with the Analytics and Insights R&D team in Tata Consultancy Services’ (TCS) Business Operations unit. She holds an M.Sc. degree in Statistics from Panjab University, Chandigarh, India. She is an expert in building various statistical and mathematical models across clusters.