Privacy and Utility Trade-off – Striking the Right Balance
Strategic business decisions are based on facts, figures and other data. Hence, data-driven decision making is the key to streamlining operations and earning customer delight. Accurate data can fuel fact-based innovation and mature strategies, encourage critical thinking and garner adequate funds. Organizations are increasingly focusing on digital realms of machine learning (ML) and artificial intelligence (AI) for analytical models that can provide such advantages.
Facebook, Google and several mobile providers collect detailed usage information from apps and websites, including scrolls, swipes and clicks. These are leveraged individually or in combinations by private and government bodies to understand trends and create ML/AI models. However, data privacy regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) have restricted access to personal, sensitive data. Given the demand for data and its role in propelling business decisions, the way forward is to strike a balance between compliance and data usage. Organizations are torn between the need to extract useful insights from sensitive data and to protect privacy. Hence, several organizations are using Data Anonymization for ML/AI based model development.
Understanding Data Anonymization
Data Anonymization is the process of protecting private/sensitive information by erasing or encrypting identifiers that make an individual detectable. Traditional data masking or Pseudonymization replaces personally identifiable information fields with artificial ones, or pseudonyms. However, these techniques mask real data with unrelated numbers and texts, leading to the loss of domain and character specific essence. This type of data cannot provide intelligent insights for decision making. The emergence of Differential Privacy (DP), which masks data based on random noise while maintaining the characteristic domain facets of the aggregated data, has been a game changer. When this data is used in analytical modelling, the output of the model insights is very close to that of the original.
Differential Privacy – A Smarter Option
Differential Privacy adds ‘noise’ to an aggregate query result to protect privacy without significantly affecting the outcome. Invented by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith, it addresses most of the limitations of the traditional approaches such as k-anonymity. If there are two identical databases, one with the required information, and another without, DP ensures that the probability of a statistical query to produce a given result is nearly the same for both the databases. The ingenuity of DP is that it protects privacy, while allowing a meaningful analysis on the dataset. It adds noise to individual data, which is averaged out when the data is aggregated to deliver results much closer to the original.
There are many differentially private data release techniques such as Laplace and Exponential, which involve releasing aggregate statistics strewn with random noises. Various privacy mechanisms are used for specific analytical tasks. For instance, when building a histogram, an average is taken and micro-data is released. Machine learning models are used for ensuring quality outcomes. During the computation of the release, noise is added to provide privacy while maximizing accuracy of results. In DP algorithms, Epsilon (ε) is an important parameter, which represents the strength of the privacy protection. A higher ε indicates less protection, while a lower ε indicates more. DP frameworks provide tools for analyzing how the value of ε impacts the data analysis result in terms of data privacy.
Examples to Understand Differential Privacy
To demonstrate DP, consider a sample data on which basic Laplace DP method has been applied. This has been used on five variables and ML-based KMeans clustering method has been used for testing the model quality for data privacy. The privacy and accuracy have been tested for different ε values and finally baselined at 0.0005, where the optimal levels of privacy and accuracy are maintained, as shown in the following table and histogram. When tested with KMeans technique with a ‘4’ cluster solution, considerable similarity was found between the original and DP data clusters. When the DP clusters were compared with the original clusters, the mean values were almost the same.
Sample Data:
Cluster Solutions
This proves that DP algorithms ensure privacy and data similarity when used for ML/AI techniques. Companies such as Google, Facebook, Apple and IBM are already working on DP for data anonymization of their ML/AI algorithms, and there is on-going research for further improvement.
The Future of Differential Privacy
DP is a relatively new field of research and users are still learning to use it effectively. There are operational difficulties in transitioning from research to practice, and in managing DP parameters. Is DP the ultimate route to achieve ‘privacy-utility trade-off’? The answer depends on the availability of generic domain-based algorithms and the opinion of regulatory bodies. If these are addressed, then DP can undoubtedly emerge as the top data anonymization product for real-time data dashboarding and ML/AI application model development.