What is test data?

Let’s talk about test data; there are some important skills everyone should learn. Healthcare organizations, insurance companies, financial institutions, and government institutions, corporate organizations; all need data to develop and test the quality of software and applications. But in most cases, their (production) data consists of personal and privacy sensitive information and the databases are often huge and therefore inconvenient for testing. That’s where test data comes in. But what is it and how is it created?

Content

Test data definition
The creation of test data
Test data preparation
Test data management
Download ebook

The definition of test data

“Data needed for test execution.”

That’s the short definition. A slightly more detailed description is given by the International Software Testing Qualifications Board (ISTQB):

“Data created or selected to satisfy the execution preconditions and input content required to execute one or more test cases.”

There is a lot of attention to development models and testing methods like security testing, performance testing, or regression testing. Testing agile and test automation are also hot topics these days. But how to handle the data (automated or not) which you need for testing software is addressed less often. That is actually quite strange since software development and testing would stand or fall on carefully prepared data cases. You can’t use just some data or just a random test case. In order to test a software application effectively, you’ll need a good and representative data set. The ideal test set identifies all the application errors with the smallest possible data set. In short, you need a relatively small (test) data set that is realistic, valid, and versatile.

How to create test data

Data can be created 1) manually, 2) by using test data generation tools, or 3) it can be retrieved from an existing production environment. The data set can consist of synthetic (fake) data, but preferably it consists of representative (real) data (for security reasons this data should of course be masked) with good coverage of the test cases. This will provide the best software quality and that is what we all want ultimately.

So beware with dummy data, generated by a random name generator or a credit card number generator for example. These generators provide you with sample data that offers no challenges to the software being tested. Of course, synthetic data can be used to enrich and/or mask your test database.

“The ideal test data identifies all the application errors with a smallest possible data set.”

Test data challenges in software testing

The preparation of data for testing is a very time-consuming phase in software testing. Various research indicates that 30-60% of a tester’s time is dedicated to searching, maintaining, and generating data for testing and development. The main reasons for this are the following:

Testing teams do not have access to the data sources
Delay in giving production data access to the testers by developers
Large volumes of data
Data dependencies/combinations
Long refreshment times

1. Testing teams do not have access to the data sources

Especially with the GDPR, PCI, HIPAA, and other data security regulations in place, access to data sources is limited. As a result, only a few employees are able to access the data sources. The advantage of this policy is that the chance of a data breach is reduced. The disadvantage is that test teams are dependent on others and that long waiting times arise.

2. Delay in giving production data access to the testers by developers

Agile is not yet being used everywhere. In many organizations, multiple teams and users work on the same project and thus on the same databases. Besides that it causes conflicts, the data set often changes and doesn’t contain the right (up-to-date) data when it’s the next team’s turn to test the application.

3. Large volumes of data

Compiling data from a production database is like searching for a pin in a haystack. You need the special cases to perform good tests and they are hard to find when you have to dig in dozens of terabytes.

4. Data dependencies/combinations

Most data values are dependent on other data values in order to get recognized. When preparing the cases, these dependencies make it a lot more complex and therefore time-consuming.

5. Long refreshment times

Most testing teams do not have the facility to self-refresh the test database. That means that they have to go to the DBA to ask for refreshment. Some teams have to wait for days or even weeks before this refresh is done.

Test data needs in software testing

There are many ways to test software code, or the end product. From unit to acceptance, from manual testing to a fully automated framework. Every software testing method has its own specific demands and needs regarding test data. Whether you perform black box testing or white box testing, functional testing or integration testing – data sets are what you need in your test environment.

How to prepare test data for testing: Test Data Management (TDM)

Because TDM can be complex and expensive, some organizations stick to old habits. The test teams (have to) accept that:

Data isn’t refreshed often (or ever);
It doesn’t contain all the data quality issues present in production;
A high percentage of bugs/faults in test cases is related to the data.

That is a shame and totally unnecessary because it doesn’t have to be complex and high-quality test data pays for itself. Simple techniques help you to save a lot of time and money. In addition, it ensures good tests and therefore high-quality software.

Here are some tips that may help:

1. Identify the source of the problem

Before you can fix a problem, you need to understand its cause. Is your data incomplete, inconsistent, biased, or noisy? Is it generated by a flawed process or a poorly designed system? Is it outdated or irrelevant? By diagnosing the root cause of your data issues, you can avoid wasting time and resources on ineffective solutions.

2. Clean up your data

Once you know what’s wrong with your data, you can start cleaning it up. This may involve removing duplicates, filling in missing values, correcting errors, or transforming variables. Depending on the size and complexity of your data, you may need to use specialized tools or techniques, such as data wrangling, data imputation, or data augmentation. You may also need to consult domain experts or subject matter specialists to ensure that your data reflects the real world.

3. Generate synthetic data

If your data is too small, too biased, or too sensitive to share, you may need to mask it and/or generate synthetic data that mimics the characteristics of your real data. However, you need to be careful not to introduce new biases or artifacts that may affect your results.

4. Collaborate with others

Sometimes, the best way to improve your data is to work with others who have complementary skills and perspectives. This may involve collaborating with data engineers, data analysts, data scientists, or business stakeholders who can help you gather, process, or interpret your data.

In conclusion, dealing with bad quality test data can be frustrating, but it’s not a hopeless situation. By using the right tools, techniques, and collaborations, you can turn your data into a valuable asset that improves your testing outcomes and your business outcomes.

Download Ebook

The Power of Test Data Management

Ebook - The Power of Test Data Management

"*" indicates required fields

FAQ

What is the definition of test data?

Short: “Data used for testing purposes.” A slightly more detailed description is given by the International Software Testing Qualifications Board (ISTQB): “Data created or selected to satisfy the execution preconditions and inputs to execute one or more test cases.”

How is test data created?

Data can be created 1) manually, 2) by using data generation tools, or 3) it can be retrieved from an existing production environment.

What does the ideal test data do?

The ideal test data identifies all the application errors with the smallest possible data set.

Mask & Generate

Subset & Reduce

Provision & Automate

Analyze & Discover

Virtualize & Control