Banner Image

Business and Technology Insights

The Masking vs Synthetic Data Debate

 
November 30, 2017

With more and more consumers’ personal data entering the digital domain, there is renewed focus on cyber security, privacy and use of personally identifiable information for systems development and testing. The General Data Protection Regulation has rekindled the masking Vs synthetic data debate. Quality Engineering (QE) practitioners comprehend the need of using anonymous data but are divided on how this data needs to be sourced. Whether to create synthetic data or mask production data is the point of contention. Through this post, I will share my inputs on this debate.

Until recently, some businesses simply used production data for testing with some tweaking to obfuscate personally identifiable customer information such as names or addresses. Data provisioning amounted to sourcing data from production since the alternative meant creating it through manual key-ins or functional automation in the test environment – a task involving long setup timelines. With the advent of TDM tools, it is possible to mask data sourced from production or create synthetic data on demand for testing.

Test data strategies employed to source the right data vary from selecting and anonymizing data from production to creating the data synthetically within the test environment. A large banking client for instance has a pre-masked test data bed – a golden copy containing various data combinations sourced from production and anonymized. A test data portal helps them mine/identify the right data from the golden copy and in cases when data is not available, data is synthetically generated by the TDM tool. Understanding the pros and cons with each approach is important, this helps decide what percentage of data needs to be sourced from production or created synthetically.

Creating referentially intact subsets of data, rendered anonymous through masking, has become the defacto practice for sourcing test data today. The ease of automating data sourcing with personally identifiable information (PII) profiled and masked to make datasets anonymous is the advantage the solution offers.

However the definition of PII is an evolving one. Data regulations such as GDPR have been quick to point out that technology has been capturing much more than names and addresses. It demands that newer PII also be anonymized and this will require even implemented solutions to revalidate if newer fields need to be masked.

Synthetic data offers a more practical solution. By creating all the data within the test region, there is no longer a need to validate compliance to regulatory needs. Today’s TDM tools are equipped to automate creation of synthetic data. Why then do organizations not consider creating synthetic data for all its testing needs? The answer lies in the inherent complexity of data that is required to ensure a good end-to-end test – what we term as ‘fit for purpose’ data. Thus in order for data to be useful for testing, it needs to be:

  • A representative set, which comprises every possible kind of production scenario
  • Referentially intact, where data created needs to be present across related tables within the data model. For instance, when transactions are created synthetically, the corresponding customer information should be present in customer tables and the products billed for in the transaction should be present in product tables.
  • Comprising of production like business scenarios –Incorrect combinations created within data sources could lead to issues during test- delaying the QE cycle when time is spent evaluating if an issue was due to data or an application defect.
  • Consistent with business processing that occurs to data when processed during an end-to-end flow.

Hence while it’s easy to create synthetic data for unit testing, getting it right for integrated testing requires strong strategy, good business and application knowledge. Organizations are often also hard pressed for time to consider dedicating the effort required to create a comprehensive solution using synthetic data.

Amidst this debate, some project teams tread the middle path of creating test data using functional automation. The ability to create data in test environment with minimal knowledge of the backend data sources, combined with the advantage that only valid business combinations are created -while enticing -might not always be amenable to create data within the short sprint timelines of an Agile project.

A prudent approach to TDM is to strike the right balance between strategy that generates the best possible data, security that protects privacy, and automation that expedites the data generation process. Instead of choosing one over the other, I’d recommend a carefully strategized combination of masked production data, augmented with synthetic data to avoid identification by inference or to create conditions not yet available in production. Before deploying your test data strategy, ensure that it addresses regulatory guidelines for your domain. There is no one size fit all solution for test data management – you must choose your test data strategy, but choose wisely.

Sarika Jeevan is a test data consultant with the Test Data Center of Excellence at Tata Consultancy Services. She is a specialist providing strategic solutions for Test Data Management implementations. With 13 years of experience in quality assurance, she has worked with clients in diverse domains including retail, banking, credit cards and payments, across multiple geographies. She has extensive project management experience and had been instrumental in creating, shaping and managing testing teams, prior to joining the Test Data CoE.