Synthetic data generation

Using privacy-sensitive (production) data for software testing and quality assurance is not only old-fashioned nowadays, but it is also not allowed by the privacy laws and regulations like GDPR, PCI, and HIPAA. But for the best software tests, you need test data that is “production-like”, right?

So how do you make sure the data sets in your test environment are representative on the one hand, but on the other hand not traceable to a natural person? The answer: data masking and synthetic test data generation – and preferably a combination of these methods.

What is synthetic test data?

Synthetic test data is ‘fake/dummy’ data that can be used for the development and testing of applications. It is not based on real data or existing information: it is artificially created with the help of algorithms. In short, there are two main reasons why synthetic test data is generated: 1) Synthetic data is used to replace privacy-sensitive information or 2) It is generated to meet specific needs or certain conditions that may not be found in the production data.

Synthetic data is also called fake data, dummy data, mock data, or example data. We just call it synthetic data. And what we mean by synthetically generated test data is data:

  • that is derived from a seed file;
  • that is randomly generated, or
  • that is generated based on logic.

Data masking with synthetic data

With data masking, you can implement masking rules like shuffle, redact and blank to mask your data. But in some instances these data masking rules aren’t sufficient to create data that are untraceable to a natural person. In these cases, you may want to decide to use a dummy data generator to create synthetically generated test data as a part of your masking project. Then you can also replace privacy-sensitive data like names, email addresses, and bank account numbers with synthetic test data. This will also help you out in aligning your test data with your test cases.

Synthetic data generators
within DATPROF Privacy

Basic 

  • Random string
  • Random date/time
  • Random number
  • Random decimal number
  • Sequential numbers
  • Color
  • Color code
  • And more…

Names

  • Brand
  • Company
  • Male First name
  • Female First name
  • Last name
  • Location
  • Country Code
  • City
  • Street
  • Country
  • And more…

Business

  • BSN (Dutch Social Security Number)
  • SSN (US Social Security Number)
  • IBAN
  • Currency Code
  • Currency Symbol
  • Military rank
  • Job/profession
  • And more…

Advanced

  • Random value from seed file (Pick values from a custom CSV seed file)
  • Regular expression (Generate values based on a regular expression)
  • And more…

Generate data from scratch

Although there’s much said about using synthetic data as a masking technique, the need for data generation from scratch should not be forgotten. If you’re developing an app for a brand new system in which you don’t have any data yet, there’s nothing to mask. But still, you need test data in order to check if the app is working with ‘production-like’ data. Or you just need volumes of data that you don’t have. In these cases, you can generate test data with the help of synthetic data generation tools. You decide what kind of data (columns, tables) you need, and with a synthetic data generation tool you fill these tables with representative, real-looking data.

Test data generation tools

Most database specialists know how to write test data, but it takes up too much time to do this manually on a regular basis. That is why the demand for synthetic test data (and test data generation tools) is growing. Test data creation should be a means to an end. Not the goal in itself.

Synthetic test data can be made with a test data generator tool. There are some free test data generators that can be found with a simple search on the internet. For a simple job such as generating a dozen first names, this is a great option. However, when you have a table with multiple columns that also have relationships with other tables, this quickly becomes an impossible and unreliable task with an open-source mock data generator.

Generating test data in itself is not the most complicated part. Algorithms do this for us. But what makes it more challenging is to be sure that the data continues to behave properly within a database so that it can be used for good tests. Just like data masking, generation needs planning and careful configuration, specifically when defining Primary Key start values or any other unique constraint implemented within a table. Therefore you need one of the licensed synthetic data generation tools on the market. In general, these tools have much more capabilities and provide technical and functional consistency – indispensable for good development and testing work.

How to generate test data for your database

When you’ve decided to use synthetically generated data for testing, you’ll need to know how to generate (large amounts of) data so it fits your database. With DATPROF Privacy that is very easy since it’s not only a data masking tool but also a test data generation tool. This test data generator tool delivers realistic and quality test data content and volume.

When you’ve connected DATPROF Privacy to your database (it supports all major relational databases like SQL Server, Oracle, DB2, and many more), it’s easy to learn how to add a generation function like any other function in your masking template and generate data for that column in your database. You can also create or add new columns to generate random data from scratch.

A great advantage of this approach is that all existing relationships between the tables remain unchanged. Your complex data structure remains functionally and technically consistent, but you use synthetic data instead of privacy-sensitive production data or as an addition to your existing data. Of course, we also support the generation of test data over a chain of systems.

Watch technical product demonstration

Test data generation within DATPROF Privacy

Tech Product Demo - Privacy

"*" indicates required fields

Consent*
Hidden

FAQ

What is synthetic test data?

Synthetic test data is generated – fake – data that can and may be used for software testing. It doesn’t contain privacy-sensitive information since it is not real.

What is synthetic test data generation?

Synthetic test data generation is the process of random data creation (from scratch or to replace existing data) with the help of a test data generation tool.

How to generate synthetic test data?

Synthetic data can be generated manually or with the help of a synthetic data generation tool. The latter is the best option if you need volumes of data that you don’t have.

What are the pros of synthetic test data?
  • Using less data
  • Perfectly aligned with your test cases
  • No risk of data leakage
  • Limited dependencies
  • Savings on storage costs and f.e. licenses
What are the cons of synthetic test data?

You need to keep in mind all the necessary attributes for your system. You need to know how many attributes your data model (not database) has, the functional requirements of your systems, data quality issues, historical data, and so on.