digital data privacy on global digital platforms and privacy data using machine learning and artificial intelligence

Corporate Themes

Differential Privacy – The Future of Data Anonymization for ML/AI Analytics

September 24, 2020

Privacy and Utility Trade-off – Striking the Right Balance

Strategic business decisions are based on facts, figures and other data. Hence, data-driven decision making is the key to streamlining operations and earning customer delight. Accurate data can fuel fact-based innovation and mature strategies, encourage critical thinking and garner adequate funds. Organizations are increasingly focusing on digital realms of machine learning (ML) and artificial intelligence (AI) for analytical models that can provide such advantages.  

Facebook, Google and several mobile providers collect detailed usage information from apps and websites, including scrolls, swipes and clicks. These are leveraged individually or in combinations by private and government bodies to understand trends and create ML/AI models. However, data privacy regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) have restricted access to personal, sensitive data. Given the demand for data and its role in propelling business decisions, the way forward is to strike a balance between compliance and data usage. Organizations are torn between the need to extract useful insights from sensitive data and to protect privacy. Hence, several organizations are using Data Anonymization for ML/AI based model development.  

Understanding Data Anonymization 

Data Anonymization is the process of protecting private/sensitive information by erasing or encrypting identifiers that make an individual detectable. Traditional data masking or Pseudonymization replaces personally identifiable information fields with artificial ones, or pseudonyms. However, these techniques mask real data with unrelated numbers and texts, leading to the loss of domain and character specific essence. This type of data cannot provide intelligent insights for decision making. The emergence of Differential Privacy (DP), which masks data based on random noise while maintaining the characteristic domain facets of the aggregated data, has been a game changer. When this data is used in analytical modelling, the output of the model insights is very close to that of the original.  

Differential Privacy – A Smarter Option 

Differential Privacy adds ‘noise’ to an aggregate query result to protect privacy without significantly affecting the outcome. Invented by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith, it addresses most of the limitations of the traditional approaches such as k-anonymity.  If there are two identical databases, one with the required information, and another without, DP ensures that the probability of a statistical query to produce a given result is nearly the same for both the databases. The ingenuity of DP is that it protects privacy, while allowing a meaningful analysis on the dataset. It adds noise to individual data, which is averaged out when the data is aggregated to deliver results much closer to the original. 

There are many differentially private data release techniques such as Laplace and Exponential, which involve releasing aggregate statistics strewn with random noises. Various privacy mechanisms are used for specific analytical tasks. For instance, when building a histogram, an average is taken and micro-data is released. Machine learning models are used for ensuring quality outcomes. During the computation of the release, noise is added to provide privacy while maximizing accuracy of results. In DP algorithms, Epsilon (ε) is an important parameter, which represents the strength of the privacy protection. A higher ε indicates less protection, while a lower ε indicates more. DP frameworks provide tools for analyzing how the value of ε impacts the data analysis result in terms of data privacy.  

Examples to Understand Differential Privacy 

To demonstrate DP, consider a sample data on which basic Laplace DP method has been applied. This has been used on five variables and ML-based KMeans clustering method has been used for testing the model quality for data privacy. The privacy and accuracy have been tested for different ε values and finally baselined at 0.0005, where the optimal levels of privacy and accuracy are maintained, as shown in the following table and histogram. When tested with KMeans technique with a ‘4’ cluster solution, considerable similarity was found between the original and DP data clusters. When the DP clusters were compared with the original clusters, the mean values were almost the same. 

Sample Data:   

Cluster Solutions 

This proves that DP algorithms ensure privacy and data similarity when used for ML/AI techniques. Companies such as Google, Facebook, Apple and IBM are already working on DP for data anonymization of their ML/AI algorithms, and there is on-going research for further improvement. 

The Future of Differential Privacy 

DP is a relatively new field of research and users are still learning to use it effectively. There are operational difficulties in transitioning from research to practice, and in managing DP parameters. Is DP the ultimate route to achieve ‘privacy-utility trade-off’? The answer depends on the availability of generic domain-based algorithms and the opinion of regulatory bodies. If these are addressed, then DP can undoubtedly emerge as the top data anonymization product for real-time data dashboarding and ML/AI application model development. 

Dr. Anantha Desik is currently spearheading data analytics, modelling, machine learning and deep learning with the integrated data management platform, TCS MasterCraft™ DataPlus at Tata Consultancy Services (TCS). He has also played a significant role in setting up the data analytics & actuarial COE for the insurance business in TCS. He has over 25 years of experience cutting across business consulting, delivery and functional experience in different industry domains such as insurance, healthcare & finance. He has published many papers in conferences and journals on digital technologies.

Ashim Roy is the Global Product Head of TCS MasterCraft™ DataPlus, which is an integrated data management software from TCS.  Ashim carries more than 26 years of industry experience in business verticals such as manufacturing, investment banking and IT industry. Ashim pursues his interest in enterprise product development, particularly, in the area of data privacy and data quality and has over 10 patents under his name. Ashim holds a master's degree in Robotics from IIT, Kanpur.

Shirish Damle is Lead Solution Architect for TCS MasterCraft™ DataPlus, which is an integrated data management software from TCS. Shirish has more than 16 years of industry experience in engineering, application development and maintenance of the finance domain products. Shirish has extensive experience in handling data from various domains. She has worked as an engineering lead in data privacy, TDM & data governance areas for TCS MasterCraft™ DataPlus.