Hi, I’m Bert, and welcome to my bi-weekly blog. Here, I share my thoughts, ideas, and insights about the ever-evolving field of test data management, and how we can stay ahead of technological developments and make the right decisions. Together with the DATPROF team, I’ve been working on a test data platform designed to empower test managers, testers, and DevOps teams to work smarter and more collaboratively. Our platform supports critical processes like data analysis, virtualization, subsetting, synthethic data generation, and anonymization, helping development teams deliver high-quality software faster.
This week, I want to talk about generative AI and its role in creating synthetic test data—a topic that no test manager or tester can afford to ignore. As Chief of Product, I constantly evaluate emerging technologies to determine their relevance and potential impact on our platform. Generative AI is no exception, and I find myself asking: What can AI-generated synthetic test data do for us? What are its limitations?
Let’s explore these questions. Below, you’ll find my blogpost broken down by topic:
Bert Nienhuis – Chief Product Officer
What can AI-generated synthetic test data do for us? When to embrace new technology?
There’s no one-size-fits-all answer for deciding when to adopt new technology. Like most product professionals, I’m intrigued by innovations in AI and their potential applications. However, experience has taught me that new technologies often take years before their true value—and limitations—become clear.
AI-generated synthetic test data is a promising development. While I’m optimistic about its potential, I remain cautious, focusing on whether it genuinely enhances our platform and adds tangible value for users.
New technology is often embraced quickly in order to find a problem that it can solve. As DATPROF we start with the problem and then choose the best technology to solve the issue. From what I have seen, Gen AI is incredible powerful in creating content that doesn’t require to be perfect. Generating text, images or videos in the creative world doesn’t have the same consequence and impact as going live with a new version of a banking system to thousands of clients. Ensuring you have the proper test data is essential for validating critical applications and minimizing the risk of issues once they go live in production.
Another thing to consider; new technology is often promising, but not yet scalable. I’ve seen too many AI synthetic data generation solutions that might work on one or two tables, but fail to scale when it comes to supplying test data for an entire ERP system or multiple systems consistently.
My advice: Stay informed, test the waters, but don’t rush to integrate new technology without clear evidence of its benefits.
The golden rule: why good, representative test data matters
Effective testing relies on high-quality, realistic, and representative test data. A compact and versatile dataset should be sufficient to identify all application errors, minimizing unnecessary data while maximizing accuracy.
You can run as many tests as you like, but it doesn’t matter unless your data is righ
While synthetic data, including AI-generated synthetic data, can create valid test datasets, it introduces concerns about trustworthiness. Testers need confidence that their data accurately reflects real-world scenarios without introducing errors or uncertainties. When using AI-generated data, it’s vital to understand how the data was created and whether it meets your quality standards.
Why production data might still be a better choice than AI-generated test data
Using production data for testing isn’t ideal—it can raise compliance risks under GDPR and CCPA and may lack efficiency.
However, in some cases, production data offers an advantage: transparency. Since this data has already been processed by the system under test, it’s easier to rule out test data issues when debugging.
In contrast, AI-generated synthetic data may produce scenarios that are theoretically possible but highly unlikely in practice. Or worse, produce test data that looks realistic but cannot occur in production. Without a clear, auditable process to validate this data, testers could waste valuable time chasing phantom bugs caused by unrealistic test cases.
Proceed with caution: advice on using AI-generated synthetic test data
Here’s my main takeaway: be cautious with AI-generated synthetic test data.
Imagine that you have a 50 terabyte Oracle database with approximately 1000 tables, approximately 20,0000 attributes, then the connections between the data are not always clear
While it has the potential to generate realistic-looking datasets, the process is rarely straightforward. Complex models trained on production data may create outputs that mimic real data without being transparent or fully reliable. For example, validating why certain data points were generated—or why others were omitted—can be a daunting task, especially for large, intricate datasets. Imagine trying to generate synthetic data for a large Oracle database with over 1,000 tables and 20,000 attributes. Without clear documentation of the AI’s decision-making process, you risk undermining the reliability of your test environment.
Want to dive deeper into this topic? Read our more indepth article on generative AI for test data generation
Bert’s biweekly
I write a biweekly blog for test managers, testers and DevOps teams about test data solutions. Want to stay updated? Hit the subscribe button 👉