Synthetic test Data Generation
What is synthetic test data and how is it created?
Using privacy sensitive (production) data for software testing is not only old-fashioned nowadays, it is also not allowed by the privacy laws and regulations like GDPR, PCI and HIPAA. But for the best software tests, you need test data that is “production-like”, right? So how do you make sure your data sets are representative on the one hand, but on the other hand not traceable to a natural person? The answer: data masking and synthetic test data generation – the best is the combination of these two methods.
What is synthetic test data?
Synthetic test data is ‘fake/dummy’ data that can be used for development and testing. It is not based on real, existing information: it is artificially created with the help of algorithms. In short there are two main reasons why synthetic test data is generated: 1) Synthetic data is used to replace privacy sensitive information or 2) It is generated to meet specific needs or certain conditions that may not be found in the production data.
Synthetic data is also called fake data, dummy data, mock data or example data. We just call it synthetic data. And what we mean by synthetically generated test data is data:
- that is derived from a seed file;
- that is randomly generated, or
- that is generated based upon logic.
Data masking with synthetic data
With data masking you can implement masking rules like shuffle, redact and blank to mask your data. But is some instances these data masking rules aren’t sufficient to create data which are untraceable to a natural person. In these cases you may want to decide to use synthetically generated test data as a part of your masking project. Now you can also replace privacy sensitive data like names, email addresses and bank account numbers with synthetic test data. This will also help you out in aligning your test data with your test cases.
Synthetic data generators within DATPROF Privacy
- Random string
- Random date/time
- Random number
- Random decimal number
- Sequential numbers
- And more…
- Male First name
- Female First name
- Last name
- Country Code
- And more…
- BSN (Dutch Social Security Number)
- Currency Code
- Currency Symbol
- And more…
- Random value from seed file (Pick values from a custom CSV seed file)
- Regular expression (Generate values based on a regular expression)
- And more…
Generate data from scratch
Although there’s much said about using synthetic data as a masking technique, the need for data generation from scratch should not be forgotten. If you’re developing an app for a brand new system in which you don’t have any data yet, there’s nothing to mask. But still, you need test data in order to check if the app is working with ‘production like’ data. Or you just need volumes of data that you don’t have. In these cases you can generate test data. You decide what kind of data (columns, tables) you need and with a synthetic data generation tool you fill these tables with representative, real-looking data.
Most database specialists know how to write test data, but it takes up too much time to do this manually on a regular basis. That is why the demand for synthetic test data is growing. Test data creation should be a means to an end. Not the goal in itself.
Synthetic test data can be made with a test data generator. There are some free test data generators that can be found with a simple search on the internet. For a simple job such as generating a dozen first names, this is a great option. However, when you have a table with multiple columns that also has relationships with other tables, this quickly becomes an impossible and unreliable task with an open source mock data generator.
Generating test data in itself is not the most complicated part. Algorithms do this for us. But what makes it more challenging is to be sure that the data continues to behave properly within a database so that it can be used for good tests. Just like data masking, generation needs planning and careful configuration, specifically when defining Primary Key start values or any other unique constraint implemented within a table. Therefor you need one of the licensed synthetic data generation tools on the market. In general, these tools have much more capabilities and provide technical and functional consistency – indispensable for good development and testing work.
How to generate test data for your database
When you’ve decided to use synthetically generated data for testing, you’ll need to know how to generate (large amounts of) data so it fits your database. With DATPROF Privacy that is very easy since it’s not only a data masking tool but also a data generation tool. It delivers realistic and quality test data content and volume.
When you’ve connected DATPROF Privacy to your database, it’s easy to learn how to add a generation function like any other function in your masking template and generate data for that column in your database. You can also create or add new columns to generate synthetic data from scratch.
A great advantage of this approach is that all existing relationships between the tables remain unchanged. Your complex data structure remains functional and technical consistent, but you use synthetic data instead of privacy sensitive production data or as an addition to your existing data. Of course we also support the generation of test data over a chain of systems.
Watch the Synthetic Data Generator demonstration:
Test data generation with DATPROF Privacy
Try it yourself – 14 days for free
Mask privacy sensitive data and generate synthetic test data with DATPROF Privacy. Try 14 days for free. No credit card required.
Important note! The trial version of DATPROF Privacy does not give access to the data generation part by default. If you’re interested in synthetic data generation, please contact firstname.lastname@example.org after you’ve download the trial version.
What is synthetic test data?
Synthetic test data is generated – fake – data that can and may be used for software testing. It doesn’t contain privacy sensitive information since it is not real.
What are the pros of synthetic test data?
- Using less data
- Perfectly aligned with your test cases
- No risk of data leakage
- Limited dependencies
- Savings on storage costs and f.e. licenses
What are the cons of synthetic test data?
You need to keep in mind all necessary attributes for your system. You need to know how many attributes your datamodel (not database) has, the functional requirements of your sytsems, data quality issues, historical data and so on.