In my most recent post I evaluated AI-generated synthetic test data. This is a follow up post where I want to dive deeper into the appeal and its limitations. For those who are new to the topic, synthetically generated test data created by AI models is a ‘recent’ development in the software testing industry. Here’s a brief explanation.  

AI-generated syntehtic test data is generated by AI models that aim to provide an alternative to using production data in testing environments. Generally, there are two ways AI models generate test data: 

  1. Training an AI model on production data 
  2. Using generative AI to create data that mimics production data

While synthetic test data may sound like the ideal solution—realistic, quick, and cost-effective—it’s far from perfect. The reality is that generating high-quality test data with AI is often complex, time-consuming, and expensive. Moreover, the process involves significant challenges related to transparency, reliability, and compliance with privacy regulations like GDPR and CCPA. 

Let’s look why AI-generated test data, at the moment, often falls short of being a truly viable solution. 

Bert Nienhuis – Chief Product Officer

The appeal of AI modelsand their limitations 

AI models can indeed produce test data that resembles production data, especially when trained on actual production datasets. But there’s a catch. The best results come from feeding the AI model with real, production data—a direct conflict with privacy laws and data protection standards. 

Why this is problematic:

  • Transparency and reliability issues: these complex AI systems are often “black boxes,” meaning we don’t fully understand how they produce the data. Without this clarity, it’s hard to guarantee quality and consistency. 
  • Legal noncompliance: training AI models on production data violates core principles of privacy legislation. Regulations like GDPR and CCPA demand explicit consent and restrict the use of personal data for secondary purposes. 

This fundamental conflict is where the promise of AI-generated test data begins to unravel. 

Why AI often falls short in compliance 

Let’s address the big question: Can AI-generated test data be compliant? 

There are two answers to this question. The first answer is:’ it could’. If a generative AI model produces data that closely resembles production data without ever using actual production data or data containing personal information, it would appear to be compliant under the current legal framework.

But if it is an AI model that is trained on production data containing personal information, the answer is a resounding no. Privacy laws like GDPR and CCPA mandate consent, transparency, and strict limitations on the use of personal data (1,2). Using anonymized production data might seem like a workaround, but isn’t it an added complexity?:

If the data is already anonymized, why not use it directly for testing instead of adding another layer of complexity by generating synthetic data? 

 

So what do we end up with? A method that is most often non-compliant but also less efficient and more labor-intensive than necessary. Even when AI-generated test data avoids production data altogether, such as through user-defined rules or generative AI, significant challenges remain: 

  • Lack of scalability: Generative AI struggles to produce consistent, high-quality test data across complex systems. 
  • Opaque processes: Without understanding how the data is created, it’s impossible to fully trust its accuracy or reliability. 

Are these methods solving the problemor creating new ones? 

The goal of synthetic test data is admirable: creating compliant, high-quality data without relying on production data. But the methods we have today fall short: 

  • Training AI models on production data directly conflicts with privacy regulations. 
  • Generative AI lacks the scalability and transparency needed for reliable test data generation. 

This raises a critical question:’aren’t these methods worse than the problem we are trying to solve?’

In summary

AI-generated test data may be a step in the right direction, but it’s not the fast, easy, or compliant solution it’s often portrayed to be. Whether through training on production data or using generative AI, the current methods fail to deliver on scalability, compliance, and simplicity. 

At DATPROF, we believe in exploring innovative solutions while staying firmly rooted in compliance and practicality. Want to dive deeper into the complexities of generative AI for test data? Check out our detailed article on the topic. 

Sources

  1. Regulation – 2016/679 – EN – gdpr – EUR-Lex. (z.d.-b). https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng?utm_source=chatgpt.com 
  2. Synthetic data. (2025c, januari 28). European Data Protection Supervisor. https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data_en 

About Bert

I write for test managers, testers and DevOps teams about test data solutions and how we can stay ahead of technological developments and make the right decisions. Want to stay updated?

Hit the subscribe button 👉

Newsletter (Bert)

First name(Required)
Last name

Thanks for reading, good luck and until next time 👋