Test Data Architecture
How to make a test data architecture that works
January 21, 2020 | Maarten Urbach
Being an architect is developing something to the needs and requirements of your client. Requirements for a new house could for example be: it needs to be rock steady because you live in an area with earthquakes. Or you may want it to float (we’re dutchies). Should it be a single storey house or do you want multiple floors? How much time do you have and what is your budget? All these kinds of questions are relevant when building a house. The same applies to building a test data architecture. You should ask yourself the same questions from a test data perspective before you start building/designing a test data platform.
In general we all have the same problems: time, performance, speed or at least the lack of all of these. Delivering software fast is important, so the faster test data is delivered, the better. That’s why in designing or building your test data platform, test data should always be delivered as fast as possible!
Single Storey house – Simple test data architect
In a ‘simple’ IT-world you might be designing a single storey ‘test data’ house. Let’s compare it for now with an IT-environment with a single source production database. So this will be a simple house, right? But it can quickly become more complex when the requirement is that it should be completely climate neutral. The same challenges occur with test data.
In a simple world with a single source database it should be easy to manage your test data. Easier deployment of your test data to QA or Dev environments. Certainly in the early stages of development (unit tests) your test data requirements are – in most cases – not really complex.
But later on in your tests the need for more representative thus more complex test data increases. For example when you’re executing regressions tests or integrations tests, the need for more complex and representative test data increases significantly as the graph below shows.
Test data architecture as a single storey house
So how to manage test data in this perspective? How can you make test data easily available in a simple IT world? The development of a test data architecture can be quite easy. Of course, bottlenecks will arise along the way. But start at the beginning and ask yourself: what’s wrong with the current test data house?
There could be a number of bottlenecks in the design of your current test data architecture, for example:
- Compliancy, you don’t comply to regulations;
- Representativeness of currently available test data (it’s outdated for several years);
- It takes a lot of time to create one test database (the same occurs with refreshing one);
- I takes a lot of time to find right test data (that fits your test cases).
Most of these bottlenecks can be resolved with an easier way of creating or deploying new Dev or QA databases. We often see organizations struggle with refreshing a test database. Refreshing a database takes a lot of time: procedural time and technical time. So as a result dev and test engineers don’t touch the test database. Because “at least we have one database right now”.
Ask yourself: “What’s wrong with the current test data architecture?”
By improving the speed and finding a way to ‘effortless’ create a new (refreshed) QA database you should already solve some bottlenecks. It could make a real impact on your test data management.
The bottleneck of time
Speed is important as we mentioned before. How can test data be delivered with the least amount of effort? Well, size and complexity of the productions database won’t help in effortless delivery of test data. Restoring a large database can take up, even only technical, very long times. Restoring and backing up for example 10 Terabytes could take up to 12 hours (source: StackExchange).
These numbers don’t sound strange to us. We’ve seen these kinds of numbers and averages in many occasions, especially in corporate environments. And in these environments you’ll probably have to work with even larger databases.
Unfortunately it can get even worse, because research shows that getting the right test data can take up to multiple days! Sometimes even weeks. Maybe in some cases there are some procedural bottlenecks. It can take quite some time to get all the right ‘yesses’.
Subsets for effortless delivery of test data
Especially in early stage development you want easy to deploy and reusable test data. You execute a test (and use test data), find some results, get a fix, test it again, but… with preferably exactly the same test data as your initial test. Otherwise, if the fix works are you sure that it works? Maybe it worked because it was a data related problem. If you want to know if the fix worked, you should be using exactly the same test data.
But restoring and backing up a database of 10 terabytes or waiting a long time for having the original starting point is not ‘desirable’, not at all. So… having a small test data set available would make life much much easier. Backing up and restoring a small test data set is something that can be done in minutes. And then you’re able to execute against the same subset.
But how to make sure that exactly the same test data set is available form an architectural perspective? Two options we often see are:
1) Create a backup from an earlier created subset. This backup can be easily and effortlessly restored within minutes. Because the subset is for example only 100Gb instead of 10 Tb.
2) Start a new subset process from the same source. Notice that the source shouldn’t have been changed.
In both test data architectures you’re able to execute a test multiple times with exactly the same test data set. In my opinion I would prefer the “backup restore strategy”. Because in that case you’re absolutely sure that your test data set remains the same over time.
A test data set for integration tests
Now we end up at the integration tests. How are you able to manage test data for integration tests? During integration tests more dependent applications are introduced. So how do you manage a test data set consistently over multiple applications for your integration tests?
Important for integration tests is getting your test data consistent over multiple databases/systems. This is the case when you mask or generate data and when you subset or virtualize test databases. You need to have consistent test data over multiple systems. You could of course use full-size non-anonymized copies for integration tests. If compliancy is a requirement you’ll mask the database and you’re done. But the ‘speed or delivery’ problem remains.
Important is: getting test data consistent over multiple databases/systems.
Reset test data for integration tests
What happens if an integration test fails? You want to reset the test data to the original values, just like you did during the regression tests. How to manage? You could refresh the database from the original (production) source. But this will (probably) take up a lot of time and therefore it is probably not the best choice.
If you don’t or aren’t able to refresh the database from the original source, you’ll be searching for a test data case that looks the same. But this might be a problem, because it looks the same. But is it the same? In our experience we see, especially in large corporate core systems, that data can be really complex.
There can be quite a few data quality challenges (also read our blog “Using production data for testing“). Due to these data quality problems it takes a lot of time to find (somewhat) comparable test data. Research shows that searching for test data can take up 75% of all your test time.
So it would be interesting to be able to reset your test data to the original values and for this reason using subsets could be very helpful. Because in that case you’re able to reset data to exactly the same starting point as before and this can be done in a minimum amount of time. The challenges for integration tests is getting a subset of test data consistent over multiple systems.
How to get test data consistent for integration testing
There are a few ways how you are able to get test data consistent over multiple systems. You could be using something like service virtualization. So you’ll virtualize dependent services to your system under test. And you may need to populate test data into these virtualize services.
You could subset related systems. So the data in your test database is corresponding to the test data in the system under test.
Building your test data architecture
Hopefully this blog will help you out in developing your test data architecture. This is a rather simplified view of reality but hopefully it can help you in designing a proper ‘house’. It may also help you in having a great test data availability. Good luck and if you’ve got any questions, please don’t hesitate to contact one of our experts or one of our partners.