Test Data Management
The major improvement in test data availability
Test data management is the new big challenge we face in software development and quality. It is important that test data is highly available and easy to refresh to improve the time to market of your software. Another reason why tdm is a challenge is the new legislations like the GDPR. So how do you want to manage test data? How do you want test data to be managed?
Why test data management?
There are multiple reasons why you want to start with test data management. There are two mostly heard reasons:
- Doing something with data masking or generating test data because of GDPR or privacy laws and
- We want to go-to-market faster, but the large environments are holding us back
To start with the first reason, the GDPR or more in general privacy laws. Due to privacy laws many organizations start to think about test data masking or synthetic test data generation. The importance of test data masking can be found in the fact that 15% of all bugs that are found are data related. These data related issues occur by f.e. data quality issues. Data masking helps you to keep these 15% data related issues in your test data set to make sure that these bugs are found and solved before you go to production.
Privacy laws and test data
Traditionally we make copies of production and use these for development of software. So we make copies in non production environments like develop, test and acceptance. Due to the privacy laws like the GDPR you aren’t able to use these copies of production anymore. We see that many organisation see to solutions:
- Generating synthetic testdata
- Mask, anonymizing, depersonalize, pseudonymise your (test) data
We believe in the best of both worlds. Not everything can be solved with synthetic test data. Real test data engineers will tell you that with the current state of technology synthetic test data alone can’t resolve all of the problems. Why? Because the problems of synthetic is in many cases the complexity of the systems, the number of applications that needs to be compliant. Generating test data for only one system of more than a 100 tables is already time consuming. So generating test data over multiple systems is …. Terrible. But if you are developing not every form of data is available in production. So you’ll probably always need a bit of synthetic generated testdata.
Masking test data, as we like to call it – but you may also call it depersonalizing or scrambling or anything else – solves two challenges mentioned above (technically pseudonymization is different than masking or anonymizing). With data masking you mask the data that is already there. So you don’t have to generate test data for multiple tables and columns: you will mask data that is already existing. So terrible data quality issues are preserved, the strange test cases still remain. And then you are able to mask other systems in the same order.
Go to market faster with your software
The second reason why test data is getting more important is going to market faster. The software development market is changing rapidly. In every organization we nowadays talk to the discuss the opportunity of going to the market faster. And to reach these goals organizations talk about DevOps, Agile and all the kinds of software development. But in many cases the technical infrastructure is holding them back. The soft side of agile is ”pretty” easy. But the hard part is infrastructure. We still all live in a waterfall era. Because your test databases cannot cope with the number of agile teams.
Nowadays it sometimes takes up to almost 5-7 working days before a test database is refreshed! In the fast software delivery development of today that should be unacceptable. And to make things even worse, it takes more than 3 persons for the same refreshment. So there is a lot of time wasted during your fast software delivery pipeline. So getting in control of your test data reducing the time for the refreshment, getting test data aligned with project is critical for success.
We traditionally use full copies of production database and deploy these in a DTAP environment. Workable, but sizing is an issues. Probably the data you will be storing won’t decrease in size. The sizing will grow. And we also see that for one production application there are multiple copies in test environments. The other major challenge is the fact that teams are frustrating each other. Frustrated because in many cases you don’t have the same number of databases as you have teams. You probably have more teams than you have databases. So you could get test cases ruined before a team could execute a test. And you need to get this in control, especially if you want to start with continuous integration or continuous deployment.
Test data provisioning
It is important that it is becomes easy to deploy test data to different environments. In the current state of your ‘test data management’ you probably go to a database administrator and ask if a refreshment of the database can be arranged. But the database administrator is bust and doesn’t really have the time or wants to make time to refresh your test environment. So the first minutes are already wasted.
So as a software development team you want to manage your refreshment, your test cases directly or via a test data management team. You want a self-service portal for refreshing the test environments.
When is the moment?
What’s also troublesome is that there aren’t enough databases available for the number of teams. So what happens is that multiple teams or people are working in the same database.
Software quality engineers are all searching for the same test cases and executing tests to the system. And during the execution of the tests the problems occur, a colleague already used the same test case and the test data is already manipulated. The test case became useless and you the searching starts all over again. Again time is wasted to finding right test cases and test data.
So in this case you want test data to be easily refreshed so you are able to use your selected test case and if you had your own test database? This last part can be achieved when you start test data subsetting. Creating smaller sized test databases gives you the ability to have multiple environments. Teams don’t need a 100% full copy of a production database, they need only the important testcases, the need at least the 15% data related issues as test data. In that case you are able to create really small test database. These are easily deployable, have a high availalibity and are easy to refresh.
There is a perfect solution to this problem. At first you probably won’t like it, but please keep reading. A simple solution for the problem is: giving every team their own database. Problem solved, nobody messes up the test cases and no time is wasted. But with this solution we introduce a new problem, because now we probably are facing a storage problem.
With data subsetting you are able to fix this problem. Because with data subsetting you only extract the test cases / test data that you need, so you’ll end up with a smaller sized copy of production. Sometimes this is only 1% percentage of a full size copy. This results in some major benefits in test data management:
- Reducing the DTAP environments drastically
- Enable teams to deliver software to market as fast as possible because you have flexible sets of test data
- Restore a small database can be done in a real small timeframe
- Teams don’t frustrate each other (no time waste of your highly educated people)
- Reduces risks of data related software faults
So getting in control of test data is getting more important. With the help of subsetting technology you’re able to deploy smaller sized sets of test data to an environment. These are flexible and sizing isn’t an issue anymore. You could easily deliver not only one but multiple test databases to your teams, so every team could have their own test database.