5 Reasons to start subsetting

Reasons for data subsetting projects

December 2, 2015 | Maarten Urbach

Clients have several reasons why they start with data subsetting. With this blog I want to share these insights. This is just a selection of most frequent reasons and different companies will have their own reasons.

To start analyzing the reasons, we need to have a common sense of what we call test data subsets. A test data subset is a smaller sized extracted, referential integer, set of data from a ‘production’ database to a non-production environment. More information about data subsetting can be found on our solutions page.

So what are the top 5 reasons why data subsetting projects are started:

1. Non-production environments are growing 3 times as fast as production

Many organizations decide that the non-production environments, such as development, test and acceptance, shouldn’t grow anymore. For example there is decided that non-production should have limited storage space. Because of this decision you’ll need to use your data storage and infrastructure more efficiently

The need for storage is increasing especially with trends like ‘internet-of-things’ and big data. In the current state when production is growing with 1 terabyte of data, non-production databases are increasing with 3. Because we copy a production database to acceptance, testing and development databases. So to manage and decrease the data in non-production, organizations start a data subsetting project.

2. The generation of test data doesn’t result in valid test cases

Many times organizations generate or manually create their own test cases or test data. The use for this synthetic test data has some pros and cons. For example it is very useful for the development of a new function or adding new products to an application. But the limitations are:

  • when test data is generated from scratch
  • manually create testdata for a data model with over 500 tables.

Why? For starters you’ll want your highly educated developers and testers doing something more useful. And secondly, generating test data with the same variation as production database has with all of its history needs a lot of creativity. For example, telephone numbers changed, bank account numbers, etc.

The most import reason for organizations to subset data instead of the use of synthetic test data is they want to trust the test data. A smaller sized set of data is more reliable as the use of generated test data.

3. It is too intensive to create synthethic test data for a data model with over 1.000 tables

Having a large data model, for example over 1.000 tables, is a great reason to start using a subsetting technology. Why? Because generating useful test data for such a data model is challenging or worse manually inserting test data in such an environment…

Generating or synthethic creation of test data is possible when an organization has less than 200 tables. I wouldn’t enjoy it, but with this kind of sizing it is possible. For more than 500 tables, generating can be done, but creating useful test data is getting more difficult. When the number of tables is growing, generating or creating test data is getting more and more difficult. It can probably be done but your results aren’t credible. For organizations with large data models subsetting technology can make the difference. And the test data is useful!

4. Test automation

Lately more clients ask us to create a test data subsets for test automation. Many organizations already use or have started thinking about the use of test automation. Implementing test automation is moving forward towards a more adult test organization.

Many organizations choose a tool, implement it and start using it! Here we are with test automation… Without using the anecdotes about a fool and a tool, later on many organizations discover that they don’t know which data can be used for test automation.

Then they need some sort of data to automate their tests. So to use their tools we see organizations implementing the following solutions:

  • We are going to generate test data; with all the pros and cons;
  • We use a copy of production; the biggest con of using a copy is that is less efficient because of the large data set.

And so a test data subsets could be an ideal asset to test automation, less test data more results.


5. Shorten the idle time

As a last reason many organizations test batch processes, normally a batch process can take 24 hours or more. One of the reasons for this long process is the use a copy of production to test their batch. The improvement of test data subsets in this process has an effect straight away. So creating is subset of production will result in improvement.

Maybe your will recognize some reasons, maybe you have other reasons. I am curious and hope this will help you and your organization! If you have any questions, don’t hesitate to contact us.

Get in touch with our experts


  • This field is for validation purposes and should be left unchanged.

Data Masking


Data Automation


Data Discovery