4 questions to find out if test data management is for you
12 October, 2021 | Maarten Urbach
When could test data management be of help in your software test project? Think about – or answer – the following questions and find the answer.
In this article, the order in which the questions are asked is irrelevant, except for the first question. You should always answer this question first.
That question is:
1. Do you use personal data during software testing?
Actually, this is the simplest question. Because if the answer to this question is yes, then you will have to think about a test data process. There are some ifs and buts, but in 90% of all cases you will have to start a process.
If the answer to this question is “no”, then you can still think about a test data process, but there are other reasons. These are discussed later in this article.
If the answer is yes, then there are some other questions to ask. Important questions are:
- In how many systems is personal data stored?
- Is there a single core system and peripheral systems?
- How does our chain work?
- What data do we want to protect?
There are many other questions that can be asked and a look at our data masking project plan is also helpful.
2. Are there many and/or large databases in use?
Many and large are relative terms. However, in larger organizations with 1000 or more employees, there are often multiple (and large) databases. It is interesting to know how many databases are used for non-production purposes and how large these non-production databases are. We often see that in larger organizations there are at least 3 non-production domains, in addition to their production domain: a development, testing and acceptance domain. But we often see for core systems of organizations that more than one test and development environment is made available. Unfortunately for organizations, these core systems are quite large.
Example: an organization has one production environment, one acceptance environment, three test environments and 3 development environments. The production environment is 2.5 terabytes in total. That gives a total of 17.5 terabytes of non-production environments, 700% compared to production!
And this is true for only one core system. In larger organizations there are several core systems with the corresponding size.
This naturally results in high storage costs. But not only storage is a cost item; with various database suppliers you also have to pay database licenses for test environments or higher environments.
If any of these apply to your situation, it’s worth doing further research into test data management. Techniques are available that can ensure that less data is needed in these environments.
3. How easy is test data available?
In general, there are three common situations when testing software and the availability of test data, namely:
- No test data is available, so this still needs to be arranged
- Test data is available, but it is not representative of production
- Test data is available and it is representative (whether anonymized or not)
For point 1 and point 2 there is, in our opinion, the same ‘solution’: generating or refreshing the test data. Looking at point 2, we do see that there are environments available with test data. These environments are not always representative. There is test data, but it is bad quality or not usable. Sometimes an environment has been around for ages (months, if not years) with all the pollution and test data that has become unusable. Or the environment is filled with self-made test data (the Donald Ducks of this world) and this does not match your wishes and needs. Ultimately, the goal is not to have test data, but to have test data that is representative of what should work in production. In short: generating a data set or refreshing your test data set is a useful action in this situation.
If test data has to be made available, you can often choose from two options. The first is to generate a test dataset yourself. The second is to refresh the test database. This can be done by requesting a new copy of production. Depending on what you want to achieve, making an (anonymized) copy of a production database is the best option. After all, then you know for sure that the test data matches the production situation and is therefore representative. The disadvantage of this is that the process to get a copy production has to be started. This is a tedious (mainly procedural) process, because often software testers cannot start this process themselves. Most testers have to submit a request to management, who then transfer this request elsewhere in the organization, who then pass it on to an employee who has to arrange this. Various studies also show that many (3-8) people are involved in this process. As a result, a lot of time is lost on procedural components, while these intermediate steps could be eliminated. However?
And that leaves the technical element of the refresh itself. The nice thing about technology is that it can be automated. In short: with a smart test data portal, these processes can be started up by software testers themselves. Then the duration of the refresh remains. This lead time can be achieved to very acceptable levels with smart subset technologies and a smart architecture, certainly in comparison with the current working method.
An alternative solution is to use (synthetic) test data generation. The advantage of this is that you don’t have to change a major process. The disadvantage is that the representativeness of the test data will always remain questionable.
In summary: if making test data available is tedious or complicated, it is certainly worth assessing the process and investigating where improvements can be made. Sometimes it’s low-hanging fruit, sometimes it’s more impactful. But there is certainly a business case to be made if availability is a challenge.
4. How many software teams work in the organization?
Another interesting question, the answer of which gives an indication of whether test data management provides improvements in the test process. Actually, the question should be more specific: how many software teams are working on one test database at the same time? Does each team have access to its own test environment? Or are there more teams than test environments or databases? The vast majority of organizations have fewer test environments available than the amount of teams they have -resulting in conflicts.
If several teams in different projects need the same application with the associated test data, this leads to problems in all cases. And as a result, the application is insufficiently tested, which ultimately leads to production disruptions.
In an ideal world, each software tester has his or her own system under test and his/her own test database. Why do we see that happening so little? This is often due to technical barriers and the cost-increasing effects. We have already mentioned the most important technical obstacles in the previous section. This includes: the size of the database and the time it takes to make a database available. The cost-increasing effects will not be extremely difficult to explain; if every tester gets his/her own 100% copy production, the costs will be significantly higher.
But then you also have to ask the question, do you need a 100% copy production to be able to test? Ultimately, testing has to do with risk-mitigating measures. You want to be sure that when the software goes into production, the software does what it’s supposed to do. An OTAP architecture is often also set up for this. In every step of the process, we catch possible errors. In that sense, software testing is sometimes not very different from saying something about an entire population by means of a sample. With a sample we can say with a certain degree of certainty (risk – also mitigating, because we can make this risk bigger and smaller) that this is representative of reality.
Imagine you work for a government institution and in a system there are for example 17,000,000 Dutch people. With a margin of error of 1% and a confidence level of 99% you need a test database that is filled with 16,752 records, which is only 0.098%! It can be even less if you increase the margins and the level. For example: if you take a margin of error of 2% and a confidence level of 95%, you are talking about only 2,401.
In short: if you are able to extract test data from a copy production in a smart way, then there is a huge advantage to be gained. By ‘smart’ we mean that you can’t just select a few rows from your tables and then put them in your test environment. All inter-relationships between tables must of course also be maintained in order to be able to perform your tests properly. In addition, you may also want to have certain edge cases and possible contamination in your test dataset. You have to find them first. Insight into your data (model) is therefore certainly important here.
Test data management is often considered when there is a need for anonymized test data. However, test data management involves much more than just data anonymization. It also helps manage and distribute test data. So not only if you work with personal data, but also if there are fewer test environments or databases than test teams, it is worthwhile to further explore the possibilities of test data management.