Data subsetting

Shifting towards agile software testing with data subsets

We all strive to shorten the time-to-market of our software. With this goal in mind we start to automate tests, develop in agile and/or scrum teams, start with Continuous Delivery and so on. All these methods can be successful – can. The successes of these methods depend on your old architecture and infrastructure. We need to innovate the way we use our current infrastructure. The DTAP architecture we use nowadays hasn’t significantly changed since its inception in the nineties. Software development has changed quite a lot however, requiring us to innovate our non production environments and their architecture.



Test data preparation

What is data subsetting

Creating a subset

Data subsetting tutorial

Test data preparation tools

Data extraction methods

Data preparation examples

Test data management

Test data preparation

What is data subsetting?

Test data subsetting is extracting a smaller sized – referential intact – set of data from a ‘production’ database to a non-production environment. Many customers ask us: “How should we create a subset?” and “How usable is a subset compared to my copy of a production database?” The concept of data subsetting is surprisingly simple: take a consistent part of a database and transfer it to another database. That’s all. Of course, the actual data subsetting isn’t that simple. Especially selecting the right data for the job is tricky, whether it’s testing or development. Why? Because you need to filter data. The complexity is in getting all the right data to create a consistent dataset over all tables that also fulfills the tester’s needs.

Current datasets

At the moment most organizations use production copies for test and development. When asked, these organizations often use arguments like: ‘This is the easiest way of working’ or ‘Only production data contains all test cases’ or ‘We can only thoroughly test using production data’. These arguments might be valid in some (test)cases, but there are reasons not to use full production copies:

  • The available time-to-market is getting shorter, because ‘The Business’ demands it and the lifecycle of software also shortens.
  • Methods like Scrum, Agile or DevOps are mostly aiming to deliver the right software faster. However, faster delivery requires higher demands on environments to support this. Large (sometimes huge) production copies aren’t helpful in achieving this.
  • Your production environment grows during time, the size needed in non-production environments grows twice as fast if you keep using full copies of production. For example: going from production size of 1 terabyte to 2 terabyte results in: 2 TB in Acceptance + 2 TB in Test + 2 TB in Development = 6 TB of total increase outside of production.

Without question, most organizations don’t need all the data they have stored in their non-production environment and it’s costing them money.

With the use of subsets:

  • The need for data storage is decreased (sometimes by more than 90%)
  • Idle times are significantly reduced
  • High control in test and development turnarounds
  • Developers influence the data they need

Also read: 5 reasons to start subsetting


Creating a subset

So how do we create a useful subset? First of all, I think we need to change the way we look at (test) data. We need to look at data the same way science does. In research using samples is common sense, it just isn’t feasible to interview the complete population of the USA or Holland. Of course, the sample needs to be representative! Based on that sample population, something can be said or concluded about the complete population. We should approach (test) data in the same way. Carefully choose test data as a representative sample and use it. Most of the times this needs some tweaks, but experiment (test) and learn from the results!

Many of our customers use filters based on a number of well known test cases and if needed complemented with a random percentage of the production data on top that. As you can see, the core of the subset is filled with known test cases, thus based on knowledge coming from the testers and developers. So in the end developers and testers influence their test data. Often this results in a smaller dataset. With the use of subsets we see that it´s possible to reduce a production database of 6 terabyte to 60 gigabytes!

Tutorial: how to subset test data

What is subsetting? Where should you start? What things should you keep in mind?

Test data preparation tools

With DATPROF subset you can extract specific selections out of production databases and make it directly available within the test environments. DATPROF Subset selects data from a full size production database, in DATPROF Subset this is the “Source database”. This data goes to a copy (test) database. In DATPROF Subset this is called the “Target database”. By applying filters when filling the Target database, this database will be smaller in size than the Source database. This enables faster and cheaper testing.

Data extraction methods

The main method DATPROF Subset uses is to access the data via one central table in the database. This table is called the start table. Other data is extracted based upon a Foreign Key relation with the start table.

During deployment of the subset project only a connection between source and target database is needed. No data passes through DATPROF Subset. This has a favourable impact on performance.

Data preparation examples

The example on the right shows three tables. There are 100 customers, 1000 orders and 10000 order rows in the source database. Every customer has 10 orders and every order has 10 related order_details. Only 10 customers have a name starting with ‘A’.

If the customers table is the start table and we only select customers having a name starting with an ‘A’, this will result in a target database with 10 customers, 100 orders and a total of 1000 order_details.

The following result is a target database that is 90% smaller than the source database, retaining the data consistency.

Test data management

DATPROF Subset enables the user with features to save the defined filters, rules and different functions, so they can be deployed repeatedly. This way a controlled subset process emerges. After data subsetting, the data in the target database can be used for testing purposes, but can also be the security base for anonymizing the data using DATPROF Privacy. Both subsetting and masking can be automated and integrated into your CI/CD pipeline with the help of DATPROF Runtime.

logo datprof subset

Extract small reusable subsets from large complex databases with DATPROF Subset. Try 14 days for free.
No credit card required.

Data Masking


Data Automation


Data Discovery