How to mask test data
What is data masking? Where should you start? What things should you keep in mind?
Data masking or anonymization revolves around altering the data in a way that it remains useful for testing and development, but the identification of a person becomes almost impossible. This article will explain the basics of anonymization and what you can do today to start moving towards anonymized non-production environments to protect privacy sensitive data. So, how to mask test data?
Data insight and analytics
First thing you need to do is discover whether you have personal data in your databases at all. If you do, how sensitive is this data? The sensitivity and the rules related to the sensitivity vary from country to country. A name in itself is not that sensitive as a person’s address. The sensitivity isn’t (in most cases) in identifying data. The sensitivity comes with what we call characteristic or descriptive data. For example, whether some has an illness or is €500.000 in debt makes the data valuable and sensitive. Knowing that somebody is called John Doe and that he lives in Amsterdam is (mostly) public information. A mere search on Google will reveal this information easily. What you want to do is keep the descriptive data, but cut the link with the actual person. This is done by changing the identifying data. So where to start?
Start by identifying the systems that contain personal data. When you know what systems contain personal data, then you can get into more detail. What data does this particular system contain and what do we want to do with it? What data needs protection, encryton What action to take depends on a couple of things. First is the information security policy. Most organizations have such a policy. Some policies prescribe the baseline for data that should be anonymized. On the other hand you have the needs of the testing community.
Also read: Data analysis in software testing
Data masking techniques
When there has been determined what data needs to be anonymized, you can start specifying how it should be anonymized. The development of the masking template starts. What techniques are you going to use? DATPROF Privacy has some built in masking functions you can start out with:
The most used built-in function is the shuffle. A shuffle takes the distinct values of one or more columns and rearranges them randomly. For example, by shuffling first and last names separately, you get new first name / last name combinations.
The blank function is self-explanatory. The blank removes (blanks) a column. This leaves no data, so this is only usable for columns not used in testing.
The scramble function replaces characters by x and numbers by 1. This function leaves no recognizable data, so the scramble too gives a result which can’t be used by testers.
The value lookup uses a reference table as input to anonymize the values in a table. The function needs a reference key, i.e. a customer id, to find the right data. This function is commonly used as part of a setup that keeps data consistent. Most of the times this setup also uses a translation table.
A random lookup also uses a reference table, but uses it in a different way. A random lookup replaces values by randomly selecting data from another table. This can be useful if you want to add test cases to existing data. For example, your data doesn’t have any diacritics and you want to add these to the first name data. Then you can use a reference table comprised of all different names, including those with diacritics, and use this as lookup.
First day in month / in year
Most people do not realize that a birthdate combined with a postal code is very identifying. This first date function makes it possible to change the date of birth to the first of the month or year. By doing this, there is less variation and therefore it is harder to find a specific person.
The above mentioned functions will not work in all situations. To add some extra flexibility you can use the custom expression function. This gives you the possibility to make your own functions. Whether this is the composition of an email address or something more advanced, the custom expression lets you do everything you can do in the SELECT of a SQL Statement.
Learn more about masking data
Synthetic data generation
Next to the standard masking functions, DATPROF Privacy also has built-in synthetic data generators which replace the existing privacy sensitive data with synthetically generated (fake, dummy) data. It depends on your test needs if you want to use masking functions, generate synthetic data or a combination of these to anonymize your test data.
Also read: Synthetic test data versus data masking
Data masking project plan
A masking plan is critical for a successful data security project and that’s what this document is designed to help you with. Download the whitepaper for free!
mask test data End-to-end
In today’s databases some values are stored more than once. The complexity starts when (test) data should consistently be masked over multiple systems. For example, a person’s name might be stored in the customer table as well as in the billing table. Data masking becomes challenging when multiple applications or sources should to be masked. For end-to-end testing it is vital that data is masked in the same order in the sources and applications.
To enable this, DATPROF Privacy can save the translation of an anonymization to a separate table. This feature can be found in the function editor, under the tab Translation table. Here you can enable or disable the creation of a translation table. When enabled, you can select in which schema and under what name you want to save the table (i.e. TT_FIRST_NAME, TT as in Translation Table).
A translation table keeps a copy of the old value (i.e. the original first name) and the new value (i.e. the shuffled first name) of an anonymization function. It also adds the primary key value(s) of the anonymized table. These keys can be used in other functions to find the right anonymized value in the translation table, so another table can be anonymized in the same manner.
Using a translation table
A translation table is often used as input for a value lookup. A translation table enables consistent anonymization throughout a database or chain of databases. It is imperative that the key you use is available in both systems and/or tables. A primary key isn’t always the right key for this, which is why DATPROF Privacy allows you to designate a ‘translation key’. This is a virtual key; no actual constraints will be created in the database but any columns designated as translation key will be added to the translation table. Social security numbers and account numbers, for instance, are good candidates for a translation key.
Using a translation table can be straightforward but it is also possible to combine multiple translation tables into one view or table. For example, you have multiple translation tables as a result of setting multiple functions on a customer table; a first name shuffle, a last name shuffle and a function which generates a new social security number. All of the resulting translation tables will have the same key: the primary key of the customer table and any translation keys you may have defined. Using these keys and a script you can create a table or view which encompasses all of the translation tables. Such a table or view is very useful later on when you apply the exact same anonymization elsewhere in your database using just one function, instead of three.
Your translation tables contain the original values. We often advise clients to treat translation tables as if they contain production data. To minimize the risk, you could place any translation tables in a separate schema with a separate privilege scheme. Going one step further, you could anonymize data on one database and distribute test sets from there, rather than having developers directly access potentially sensitive data.
Another way to mask test data consistently over multiple systems or (cloud) applications is with deterministic data masking. With deterministic masking a value in a column is replaced with the same value whether in the same row, the same table, the same database/schema and between instances/servers/database types. Thanks to deterministic masking, no translation tables are needed anymore.
Test data management
Full control of test data is the next step after compliance. Thanks to technology and tools like DATPROF Runtime we are able to control and automate test data easily nowadays. By giving each test team their own (secure) environment (subsetted to the right size), management is less burdened with the masking and refresh process. A complete test data management platform is great for efficiency and performance. On premise or cloud based – it’s the strategy that helps build and release that application even faster and provide business growth.
Also read: Test data management
What is data masking?
Data masking or anonymization revolves around altering the data in a way that it remains useful for testing and development, but the identification of a person becomes almost impossible.
What data masking techniques are there?
Shuffle, blank, scramble are the best known and simplest techniques. More ingenious masking techniques are lookups, custom expressions and replacing data with synthetically generated (fake) data.
What is deterministic data masking?
With deterministic masking a value in a column is replaced with the same value whether in the same row, the same table, the same database/schema and between instances/servers/database types. This way you can easily mask the data consistently over multiple systems.
Start your 14 day free trial
Mask privacy sensitive data and generate synthetic test data with DATPROF Privacy. Try 14 days for free. No credit card required.