With the ever-expanding volumes of data involved in digital transformation, the need for adequate data privacy has become imminent. Data privacy describes the practices used to ensure that the data shared by customers is used only for the intended purpose. Government organizations across the globe are concerned about the protection of personal data available in the public domain. Hence, they have been introducing complex regulatory requirements like General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA) and others. All these regulations precisely define what they mean by data privacy, suggest several best practices, and spell out the rights of consumers and businesses. However, the actual implication of data privacy is different for each legislation.
Data Anonymization for Safeguarding Privacy
Regulations often refer to terms such as data masking, data encryption and data hashing. These procedures ensure de-identification of data, thereby preventing any leakage of personal information. They protect sensitive data and establish data privacy control mechanisms. In data masking, the original data is changed such that there is no reversibility, whereas in data encryption, the original data is transformed into encoded data and is restored using a key. Some of the well-defined masking techniques are shuffling, substitution, number variance, character scrambling, perturbation randomization, and checksum masking. The encryption techniques include different algorithms such as Advanced Encryption Standard (AES), Rivest–Shamir–Adleman (RSA), Blowfish, and others.
Data privacy enablement or data anonymization is a type of information sanitization wherein the personally identifiable data is removed from datasets, by applying encryption or using masking techniques, to preserve the privacy of a data subject. Post data anonymization, datasets will not contain any personally identifiable information. Thus, the risk of unintended disclosure of personal data is significantly reduced, and the datasets are in compliance with the regulatory requirements. In the view of growing demands for using artificial intelligence (AI) in decision-making, businesses are seeking AI-based models that can use post-anonymized data to leverage the benefits of AI while adhering to privacy control.
When masking or encryption techniques are used for data anonymization, the original data patterns and interrelationships are lost. So, the models developed based on the data derived from traditional data anonymization techniques cannot provide meaningful insights. When data is anonymized for numerical fields, the original data patterns are lost, making it unsuitable for analytical modelling. Only a few techniques can circumvent this problem. Two techniques that can effectively support data anonymization for numerical data analytics are: (1) Fully Homomorphic Encryption (FHE), and (2) Differential Privacy (DP). While the former is an encryption technique, the latter is a perturbation method. These two emerging techniques comply with privacy regulations and ensure effective outcomes by developing analytical models on anonymized data, such that the model outputs are similar to those based on the original data.
Understanding the FHE and DP Models
Fully Homomorphic Encryption allows users to perform numerical computations on the encrypted data and generate output in the encrypted form. The decrypted result is identical to the output of the original data and the operations performed on the encrypted ciphertext. FHE is the strongest encryption technique that allows the evaluation of arbitrary circuits composed of multiple types of gates with unbounded depth. It currently supports multipliable operations (addition and multiplication). Research is being conducted to enable it to perform more computations on the encrypted data. Currently, only limited machine learning (ML) models are available that support work on encrypted data. In FHE, the multiplicative depth of circuits is the primary practical limitation as it does not support many calculations and requires substantial memory and computational time.
Differential Privacy uses the data perturbation technique with associated methods for limiting statistical disclosure and controlling privacy risks. DP collects and shares aggregated information about user data while maintaining the privacy of individual users. The anonymized data from DP maintains similar aggregated statistics as that of the original data. Therefore, when AI models are applied on the anonymized data, its accuracy will be similar to that of the original data. Generally, the internal calculations are done based on the aggregated data in most of the AI algorithms. Hence, the AI model developed on DP-enabled anonymized data provides outputs similar to the original data models. The main usage areas of DP are real-time dashboarding and analytical modeling.
FHE and DP – Pros, Cons, and the Way Ahead
The FHE technique is strong in terms of data privacy when compared to other ways of decoding the original data. However, DP scores higher in terms of parameter usability, computation support, memory requirement and algorithm support. Currently, considerable research is being conducted on the usage of DP, and new methods and techniques are being invented for handling different types of data. It is noteworthy that the top ten technology industries are already using DP for privacy compliance and AI modeling.