Data subsetting

All you need to know about data extraction and test data preperation

We all strive to shorten the time-to-market of our software. With this goal in mind we start to automate tests, develop in agile and/or scrum teams, start with Continuous Delivery and so on. All these methods can be successful – can. The successes of these methods depend on your old architecture and infrastructure. We need to innovate the way we use our current infrastructure. The DTAP architecture we use nowadays hasn’t significantly changed since its inception in the nineties. Software development has changed quite a lot however, requiring us to innovate our non production environments and their architecture.

1. Test data preparation

What is data subsetting?

Test data subsetting is extracting a smaller sized – referential integer set of data from a ‘production’ database to a non-production environment. Many customers ask us: “How should we create a subset?” and “How usable is a subset compared to my copy of a production database”? The concept of data subsetting is surprisingly simple: take a consistent part of a database and transfer it to another database. That’s all. Of course, the actual data subsetting isn’t that simple. Especially selecting the right data for the job is tricky, whether it’s testing or development. Why? Because you need to filter data. The complexity is getting all the right data to create a consistent dataset over all tables that also fulfills the testers needs.

Current datasets

At the moment most organizations use production copies for test and development. When asked, these organizations often use arguments like: ‘This is the easiest way of working’ or ‘Only production data contains all test cases’ or ‘We can only thoroughly test using production data’. These arguments might be valid in some (test)cases, but there are reasons not to use full production copies:

  • The available time-to-market is getting shorter, because ‘The Business’ demands it and the lifecycle of software also shortens.
  • Methods like Scrum, Agile or DevOps are mostly aiming to deliver the right software faster. However, faster delivery requires higher demands on environments to support this. Large (sometimes huge) production copies aren’t helpful in achieving this.
  • Your production environment grows during time, the size needed in non-production environments grows twice as fast if you keep using full copies of production. For example: going from production size of 1 terabyte to 2 terabyte results in: 2 TB in Acceptance + 2 TB in Test + 2 TB in Development = 6 TB of total increase outside of production.

Without question, most organizations don’t need all the data they have stored in their non-production environment and it’s costing them money.

With the use of subsets:

  • The need for data storage is decreased (sometimes by more than 90%)
  • Idle times are significantly reduced
  • High control in test and development turnarounds
  • Developers influence the data they need.

→ 5 reasons to start subsetting

Creating a subset

So how do we create a useful subset? First of all, I think we need to change the way we look at (test) data. We need to look at data the same way science does. In research using samples is common sense, it just isn’t feasible to interview the complete population of the USA or Holland. Of course, the sample needs to be representative! Based on that sample population, something can be said or concluded about the complete population. We should approach (test) data in the same way. Carefully choose test data as a representative sample and use it. Most of the times this needs some tweaks, but experiment (test) and learn from the results!

Many of our customers use filters based on a number of well known test cases and if needed complemented with a random percentage of the production data on top that. As you can see, the core of the subset is filled with known test cases, thus based on knowledge coming from the testers and developers. So in the end developers and testers influence their test data. Often this results in a smaller dataset. With the use of Subset we see that it´s possible to reduce a production database of 6 terabyte to 60 gigabytes!

Tutorial: how to generate test data

What is subsetting? Where should you start? What things should you keep in mind?

2. Test data preparation tools

With DATPROF subset you can extract specific selections out of production databases and make it directly available within the test environments. DATPROF Subset selects data from a full size production database, in DATPROF Subset this is the “Source database”. This data goes to a copy (Test) database. In DATPROF Subset this is called the “Target database”. By applying filters when filling the Target database, this database will be smaller in size than the Source database. This enables faster and cheaper testing.

3. Data extraction methods

The main method DATPROF Subset uses is to access the data via one central table in the database. This table is called the Start table. Other data is extracted based upon a Foreign Key relation with the Start table.

During deployment of the Subset project only a connection between Source and Target database is needed. No data passes through DATPROF Subset. This has a favourable impact on performance.

4. Data preparation examples

The example on the right shows three tables. There are 100 customers, 1000 orders and 10000 order lines in the source database. Every customer has 10 orders and every order has 10 related order_details. Only 10 customers have a name starting with ‘A’.

If the customers table is the start table and we only select customers having a name starting with an ‘A’, this will result in a target database with 10 customers, 100 orders and a total of 1000 order_details.

The result is a target database that is 90% smaller than the source database, retaining the data consistency.

5. Data extraction tips

DATPROF Subset enables the user to save the filters and functions, so they can be deployed repeatedly. This way a controlled subset process emerges. After subsetting, the data in the target database can be used for testing purposes, but can also be the base for anonymizing using DATPROF Privacy.

Want to try yourself?

With our 14-days free trial you can try DATPROF Subset yourself. Generate, subset your own database and see how easy it can be to subset your test data.

Click on the download button and start today!

 

Data Subsetting

Extract small reusable subsets from large complex databases and speed up your testing.

Data Masking

DATPROF Privacy

Data Subsetting

DATPROF Subset

Data Provisioning

DATPROF Runtime

Data Discovery

DATPROF Analyze