Using production data for testing
And how it affects high-quality software development
Many organizations have a test or QA environment that is connected to a test/QA data source – the database with test data. Some of these test databases contain fake data that is made up by QA engineers. This fake data is either produced by hand or by self-built scripts. Yes; this seems pretty outdated, but it still happens a lot. However, this method causes certain problems: many production issues are due to the lack of realistic test data. Dummy data doesn’t contain every data issue present in production, which may result in bad or even useless test results.
Testing with production data
To ensure software is of the highest quality possible, you’ll need to keep the test environment as “in sync” as possible with production. That’s why many QA teams copy production data to a test environment or the QA data sources to catch more (preferably all) edge cases and issues. But there are a few things to consider regarding working with production data:
- Does the data contain privacy-sensitive information? If so, you need to mask, filter or simply remove this data due to privacy regulations.
- Can the test environment handle that much data? If not, you need to filter the data or select certain cases.
- What happens when you need a new copy of production and it overwrites the earlier changes? Will it break your tests? You would need some sort of refresh option.
- Are there dependencies between data? Then you’d need to test all possible circumstances or settings.
Risk of using production data for testing
The above points of attention show that testing with (a copy of) production data is not as easy as it sounds. In fact, it can be very risky to just copy production data to a test environment because of possible privacy-sensitive or personally identifiable information. When your systems contain personal data, you need to keep data protection in mind. Also, storage and database license (costs) can become a serious issue. If you make multiple copies of production (one copy for every test team), the size gets out of control quickly and the bill runs high.
But does that mean you can’t use production data in a test environment?
Using production data in a test environment is not impossible, as long as you take compliance and sizing into account. For both of these problems, there is a very good solution: masking privacy-sensitive data with DATPROF Privacy and subsetting data with DATPROF Subset.
Mask production data
You don’t want to risk a fine for breaking privacy laws like the GDPR. With DATPROF Privacy you easily make your test data anonymous. By masking or scrambling the data, DATPROF Software enables you to mask sensitive data so it can’t be traced to a person anymore. For example, you can shuffle first and last names, you can blank fields, generate synthetic data like a new SSN, bank account numbers, create your own masking rules, and many more. It also makes sure that data is consistent over multiple applications and databases.
Subset production data
With its patented algorithm, DATPROF Subset extracts specific selections (even less than 1%) out of the production database. You can specify and filter which data you want to be made available in your subset. You can add extra filters, transform data with column expressions and add extra dependencies or custom foreign keys. This way the subset contains all the issues present in production, but storage isn’t a problem anymore. With subsets, you can enable every test team with a test data set of its own. Plus it’s great for the performance and refresh time when you only work with small subsets instead of full copies.
Production data vs test data
Fake data only won’t help you with high-quality software development. It doesn’t contain production data issues and edge cases you want to discover. So then should you make use of production data in the test environment? Absolutely. The advantages of using real data out of production systems for testing are great. But under the condition that you mask and subset your data before you use it for test and dev. Otherwise, you’ll get in trouble because of the privacy regulations and/or storage problems. Best practices in short: manage your systems, their security, and automation for optimal TDM.
Learn how to manage your test data
What is production data?
Production data is information that is persistenly stored and used to conduct day-to-day business tasks and processes.
Can I use production data for testing?
Yes, you can. But only if you mask the privacy sensitive data to comply with privacy regulations like GDPR, PCI and HIPAA.
Why not just generate synthetic data?
Fake data won’t help you create high quality software. It doesn’t contain production data issues you want to discover.
What are the risks of using production data for testing?
Production systems in many cases contain personally identifiable information. This personal data needs protection – it may not be used for things like development and testing. If you do so, you risk data leakage. Besides, using production data can add up the bill significantly due to it’s size.
What are the advantages of using production data for testing?
Raw production data, real data contains all the edge cases that you need for your tests. If your tests succeed using production data, you know for sure it will work in production. But you at least need to mask it regarding gdpr.