How useful is a subset?

How useful is a subset of production for testing and development?


Many customers ask me: “how should we create a subset?” And “how usable is a subset compared to my copy of a production database”?

The concept of data subsetting is surprisingly simple, take a consistent part of a database and transfer it to another database. That’s all. Of course, the actual data subsetting isn’t that simple. Especially selecting the right data for the job is tricky, whether it’s testing or development. Why? Because you need to filter data. The complexity is getting all the right data to create a consistent dataset over all tables that also fulfills the testers needs.

Current datasets?

At the moment most organisations use production copies for test and development. When asked, these organisations often use arguments like: ‘This is the easiest way of working’ or ‘Only production data contains all test cases’ or ‘We can only thoroughly test using production data’. These arguments might be valid in some (test)cases, but there are reasons not to use full production copies:

  • The available time-to-market is getting shorter, because ‘The Business’ demands it and the lifecycle of software also shortens.
  • To me it feels that methods like Scrum, Agile or DevOps are mostly aiming to deliver the right software faster. However, faster delivery require higher demands on environments to support this. Large (sometimes huge) production copies aren’t helpful in achieving this.
  • Your production environment grows during time, the size needed in non-production environments grows twice as fast if you keep using full copies of production. For example: going from production size of 1 terabyte to 2 terabyte results in: 2 TB in Acceptance + 2 TB in Test + 2 TB in Development = 6 TB of total increase outside of production.

Without question, most organizations don’t need all the data they have stored in their non-production environment and it’s costing them money.

Filling the subset

So how do we create a useful subset? First of all, I think we need to change the way we look at (test) data. We need to look at data the same way science does. In research using samples is common sense, it just isn’t feasible to interview the complete population of the USA or Holland. Of course, the sample needs to be representative! Based on that sample population, something can be said or concluded about the complete population. We should approach (test) data in the same way. Carefully choose testdata as a representative sample and use it. Most of the times this needs some tweaks, but experiment (test) and learn from the results!

Many of our customers use filters based on a number of well known test cases and if needed complemented with a random percentage of the production data on top that. As you can see, the core of the subset is filled with known test cases, thus based on knowledge coming from the testers and developers. So in the end developers and testers influence their test data. Often this results in a smaller dataset. With the use of Subset we see that it´s possible to reduce a production database of 6 terabyte to 60 gigabytes!

Subscribe to our newsletter

Recieve free updates on new blogs, webinars and tutorials

Let us know how to reach you. We keep you updated on the latest developments concerning test data, test data management, subsetting and masking. You can unsubscribe at any time.

Data Masking


Data Subsetting


Data Provisioning


Data Discovery