Enterprise IT teams have always been looking for functionally usable and adequate data, for testing newly built business applications, or changes to existing applications. Until recent-past, enterprises could conveniently dump production data into lower IT environments, for aforesaid testing purpose. This method was inexpensive and allowed to create curated datasets that helped testing of precise business scenarios. However, with global emergence of various data protection regulations, enterprises need to protect and ensure privacy of personally identifiable and sensitive data. Some such regulations require enterprises, to clearly define the intent of collecting and processing personal data. Enterprises can no longer use data any which way. Hence, enterprises started looking at data masking – method of replacing sensitive pieces of information with fictitious and yet realistic information – in a serious way.
In the last few years, data masking has gained widespread acceptance, as a credible data privacy-enabling technology. While data masking has its share of successes, in various use-cases of privacy-safe data provisioning, it still needs to access the original sensitive data, for transforming data into privacy-safe version. Secondly, there are questions around the potential risk of deciphering original data, by bad actors, using sophisticated mathematics.
This is where Synthetic Data Generation is emerging as another worthy privacy-enabling technology.
Synthetic data generation refers to the approach of a software-machine automatically generating required data, with minimal inputs from user’s side. Typically, synthetic data-generating software requires: (1) metadata of data store, for which, synthetic data needs to be generated (2) mechanism to build realistic yet fictitious values e.g. value-lists, regular expressions (3) an understanding of all data relationships – the ones defined at database-level and others being managed at application code-level. In addition to such user-inputs, synthetic data generation could leverage the power of ML/AI, for an intelligent metadata and relationship discovery.
The idea of machine-crafted data has been prevalent for some years. It seems to be coming of age, now. Here are some drivers that may propel synthetic data generation, towards greater adoption:
Analytical model testing: AI and ML are increasingly taking the centerstage, as enablers of intelligent way of handling business processes. In analytics, data of specific characteristics is required for training the models being constructed. Such data must be adequate – and generated quickly – for delivering value to the business, with agility. Enterprises experience paucity of required data for two reasons: (1) regulatory constraints may prevent actual access to potentially confidential data (2) data of desired characteristics may be absent. For instance, data may need to exhibit a peculiar distribution of values for a given variable, or data must demonstrate a relationship between distribution patterns of different variables that are part of the analytics, etc. There may be no readily available data, production or anonymized, that meets such criteria. Such scenarios may call for synthesizing data, quickly and automatically, through synthetic data generation.
Greenfield application testing: When new business applications are built, there would not be relevant existing data to mask and use for testing. Synthetic data generation can automate generation of functionally usable and privacy-safe data for testing new applications.
Integrated application testing: When multiple applications undergo consistent and concurrent upgrades in an integrated fashion, the downstream application may have a dependence on data being generated by some upstream application. If upstream application enhancements are not yet ready, the downstream application may need to simulate the kind of data that upstream application is expected to generate, following the upgrade. Here, synthetic data generation can help in enterprise-wide consistent testing of changes to interdependent applications.
Extreme sensitivity of applications: Certain business applications are just too confidential for their data to be even used as input for data masking activity. For instance, data from applications in areas such as National Security, Defense, Genetics, Healthcare, Research and Atomic Power, is extremely confidential. For generating privacy-safe data for testing such applications, synthetic data generation is a viable alternative to data masking.
Consent constraints: In some scenarios, privacy-safe data needs to be provisioned for a business purpose, but enterprise may not have secured appropriate consent from their customers, for the given purpose. Further, soliciting consent is an involved and time-consuming activity. Synthetic data generation could be explored as a faster option for data generation.
Synthetic data generation is evidently creating interest. This approach may not be a universal solution to all data generation scenarios. Also, the approach may need intelligence through ML/AI, and be able to handle real-world complex scenarios of generating inter-related data, and preferably, generate data amenable to a particular domain. Nevertheless, it is an exciting technology and helps where other privacy-enabling technologies are unable to deliver. Today, synthetic data generation may require co-existence with data masking. In future, there could be an increased convergence between the two, resulting into a more holistic data generation solution.