Synthetic data versus data masking

Coping with test data and GDPR in Software Quality Processes

Should we use synthetic data or data masking? We are all searching for the answer to this question. But why is this question relevant? There’s only one real reason: privacy regulations! This can be GDPR, PCI, HIPAA or any other privacy regulation. We want to know which of these technologies is workable. Because if there weren’t any privacy regulations we’d probably still use full size copies of production databases.

Masking data

The impact of privacy regulations on the software quality process

So what is the impact of these regulations on the software quality process? How can we cope with these privacy regulation in our quality process? We as DATPROF know a lot about these privacy regulations. There is one thing you should remember after you read this blog, and that is:

You can’t use copies of production databases in a non production environment.

Unfortunately you just simply can’t anymore. At least not the way we all did in 2017. If you’re coping with data protection legislation it is not so simple anymore to use copies of production in a test environment. So how is this impacting the software quality process? Because production data is probably the best test data to use. And the reason why it is the best test data is because we can be quite certain that if the software works on a copy of production, it will also work in a production environment. At least that’s what you might expect. But unfortunately we just can’t use copies anymore.

Coping with data protection legislation in Software Quality

So how can we cope with these privacy rules and regulations and the quality of our software? Well, there are only two real options left with the privacy regulations in the back of our minds:

  1. Generating synthetic test data
  2. Masking test data

Or a combination of these two. We don’t see any other possibility right now that can help you going to production with the highest possible software quality. So if you’re dealing with privacy regulations you should think about using one of these strategies or technologies. But what are the pros and cons of these technologies?

The pros of synthetic test data

  • Using less data
  • Perfectly aligned with your test cases
  • No risk of data leakage
  • Limited dependencies
  • Savings on storage costs and f.e. licenses

If you’re able to generate synthetic test data, you’ll have less test data available in your environment and it is perfectly aligned with your test cases. If you have an empty database, otherwise it is not really a pro it is a disadvantage, because your test database is growing because of using extra data.

Tip of the iceberg

Unfortunately there are some cons for synthetically generating test data. In many cases when we start to think about synthetic test data generation, we probably think about the tip of the iceberg. And initially generating synthetic test data sounds pretty easy. But quickly some problems arise because you’ll also have the bottom part of the iceberg. And this is far more complicated. 

For example: what about generating synthetic test data for that other system you needed? How do I get the test data into my database? Can it be imported via the database or only via the front-end of a system? Are you allowed to do so? For example: Salesforce only allows you to use the front end. You can’t communicate directly with a database. And what about the second system or the third one? You probably don’t have just one system, you probably have multiple systems.

So we need to think about generating synthetic test data for more than one system and if the first system allows you to import test data, it’s not said it is possible in the second or third system. And so, generating synthetic test data becomes more and more complicated.

The disadvantages of synthetic test data

When you think about generating all attributes with synthetic test data generators, it is a pretty difficult job already. In many cases a simple data model already exists out of a 1.000 tables, with for example 15 columns. So if you want to generate synthetic test data for this system you already need to have 15.000 generators (1.000 x 15). You’ll need these to be able the have all the necessary attributes for your system. So if you start thinking about synthetic test data generation, you’ll need to know:

  1. How many attributes your datamodel has (not database)
  2. The functional requirements of your systems
  3. Data quality issues
  4. Historical data

When to use synthetic test data generation?

The only reason we can imagine using synthetic test data, is if you’re building a whole new system or application. And for this reason you don’t have any test data to mask. So you’ll need to synthetically generate test data. Or maybe in a situation when there is only one system under test. If you want to be certain about the software quality it is highly time consuming to generate right test data.

Another way of using synthetic generated test data is when you’re not sure if your data masking is sufficient. Then you could decide to generate new names or bank account numbers for example. Just to be sure this data is safe. In DATPROF Privacy this synthetic test data generator is built-in. That means that when you’ve connected your database, you can easily generate values for a certain column in a certain table. When you use DATPROF Privacy for masking and generating, you’re sure it is functional and technical consistent over multiple systems.

See Data Masking & Generation in Action

Watch the demo and learn how to mask your privacy sensitive data and generate synthetic data.

Tech Product Demo - Privacy

"*" indicates required fields


TDM Platform

The right test data in the right place at the right time. Masked, generated, subsetted, virtualized and automated at the push of a button.