Synthetic test data versus data masking
Coping with GDPR in Software Quality
Should we use synthetic test data or masked test data? We are all searching for the answer to this question. But why is this question relevant? There’s only one real reason: privacy regulations! This can be the GDPR or any other privacy regulation. We want to know which of these technologies is workable. Because if there weren’t any privacy regulations we’d probably still use full size copies of production databases.
The impact of privacy regulations on the software quality process
So what is the impact of these regulations on the software quality process? How can we cope with these privacy regulation in our quality process? We as DATPROF know a lot about these privacy regulations. There is one thing you should remember after you read this blog, and that is:
You can’t use copies of production databases in a non production environment.
Unfortunately you just simply can’t anymore. At lease not as simple as we all did like in 2017. If you’re coping with the GDPR it is not so simple anymore to use copies of production in a test environment. So how is this impacting the software quality process? Because production data is probably the best test data to use. And the reason why it is the best test data is because we can be quite certain that if the software works on a copy of production, it will probably also work in a production environment. At least that’s what you might be expecting… But unfortunately we just can’t use copies anymore.
Coping with GDPR in Software Quality
So how can we cope with these privacy regulations and the quality of our software? Well, there are only two real options left with the privacy regulations in the back of our minds:
- Generating synthetic test data
- Masking test data
We don’t see any other possibility right now that can help you going to production with the highest possible software quality. So if you’re dealing with privacy regulations you should think about using one of these strategies or technologies. But what are the pros and cons of these technologies?
The pro’s of synthetic test data
- Using less data
- Perfectly aligned with your test cases
- No risk of data leakage
- Limited dependencies
- Savings on storage costs and f.e. licenses
If you’re able to generate synthetic test data, you’ll have less test data available in your environment and it is perfectly aligned with your test cases. If you have an empty database, otherwise it is not really a pro it is a disadvantage, because your test database is growing because of using extra data.
Top of the iceberg
Unfortunately there are some cons for synthetically generating test data. In many cases when we start to think about synthetic test data generation, we probably think about the top of the iceberg. And initially generating synthetic test data sounds pretty easy. But quickly some problems arise because you’ll also have the bottom part of the iceberg. And this is far more complicated.
For example: what about generating synthetic test data for that other system you needed? How do I get the test data into my database? Can it be imported via the database or only via the front-end of a system? Are you allowed to do so? For example; Salesforce only allows you to use the front end. You can’t communicate directly with a database. But what about the second system or the third one? You probably don’t have just one system, you probably have multiple systems.
So we need to think about generating synthetic testdata for more than one system and if the first system allows you to import test data, it’s not said it is possible in the second or third system. And so, generating synthetic test data becomes complicated.
The disadvantages of SYNTHETIC TEST DATA
It gets even more complicated, because when you think about generating all attributes with synthetic test data generators, it is a pretty difficult job already. In many cases a simple data model already exists out of a 1.000 tables, with for example 15 columns. So if you want to generate synthetic test data for this system you already need to have 15.000 generators (1.000 x 15). You’ll need these to be able the have all the necessary attributes for your system. So if you start thinking about synthetic test data generation, you’ll need to know:
- How many attributes your datamodel has (not database)
- The functional requirements of your systems
- Data quality issues
- Historica data
When to use synthetic test data generation?
The only reason we can imagine using sysnthetic test data, is if you’re building a whole new system or application. And for this reason you don’t have any test data to mask. So you’ll need to synthetically generate test data. Or maybe in a situation when there is only one system under test. If you want to be certain about the software quality it is highly time consuming to generate right test data.
Test Data Masking
Protect privacy sensitive data in non-production databases, comply with legislation and prevent data leaks in QA environments