How to mask test data
What is data masking? Where should you start? What things should you keep in mind?
Anonymization revolves around altering the data in a way that it remains useful for testing, but the identification of a person becomes almost impossible. This article will explain the basics of anonymization and what you can do today to start moving towards anonymized non-production environments.
First thing you have to do is determine whether you have personal data in your databases at all. If you do, how sensitive is this data? The sensitivity and the rules related to the sensitivity vary from country to country. A name in itself is not that sensitive as a person’s address. The sensitivity isn’t (in most cases) in identifying data. The sensitivity comes with what we call characteristic or descriptive data. For example, whether some has an illness or is €500.000 in debt makes the data valuable and sensitive. Knowing that somebody is called John Doe and that he lives in … is (mostly) public information. A mere search on Google will reveal this information. What you want to do is keep the descriptive data, but cut the link with the actual person. This is done by changing the identifying data. So where to start?
Start by identifying the systems that contain personal data. When you know what systems contain personal data, then you can get into more detail. What data does this particular system contain and what do we want to do with it? What action to take depends on a couple of things. First is the information security policy. Most organizations have such a policy. Some policies prescribe the baseline for data that should be anonymized. On the other hand you have the needs of the testing community.
Data masking techniques
When there has been determined what data should be anonymized, you can start specifying how it should be anonymized. What techniques are you going to use? DATPROF Privacy has some built in scrambling functions you can start out with:
The most used built-in function is the shuffle. A shuffle takes the distinct values of one or more columns and rearranges them randomly. For example, by shuffling first and last names separately, you get new first name / last name combinations.
The blank function is self-explanatory. The blank removes (blanks) a column. This leaves no data, so this is only usable for columns not used in testing.
The scramble function replaces characters by x and numbers by 1. This function leaves no recognizable data, so the scramble too gives a result which can’t be used by testers.
The value lookup uses a reference table as input to anonymize the values in a table. The function needs a reference key, i.e. a customer id, to find the right data. This function is commonly used as part of a setup that keeps data consistent. Most of the times this setup also uses a translation table.
A random lookup also uses a reference table, but uses it in a different way. A random lookup replaces values by randomly selecting data from another table. This can be useful if you want to add test cases to existing data. For example, your data doesn’t have any diacritics and you want to add these to the first name data. Then you can use a reference table comprised of all different names, including those with diacritics, and use this as lookup.
First day in month / in year
Most people do not realize that a birthdate combined with a postal code is very identifying. This first date function makes it possible to change the date of birth to the first of the month or year. By doing this, there is less variation and therefore it is harder to find a specific person.
The above mentioned functions will not work in all situations. To add some extra flexibility you can use the custom expression function. This gives you the possibility to make your own functions. Whether this is the composition of an email address or something more advanced, the custom expression lets you do everything you can do in the SELECT of a SQL Statement.
In today’s databases some values are stored more than once. The complexity starts when (test) data should consistently be masked over multiple systems. For example, a person’s name might be stored in the customer table as well as in the billing table. Data masking becomes challenging when multiple applications or sources should to be masked. For end-to-end testing it is vital that data is masked in the same order in the sources and applications.
To enable this, DATPROF Privacy can save the translation of an anonymization to a separate table. This feature can be found in the function editor, under the tab Translation table. Here you can enable or disable the creation of a translation table. When enabled, you can select in which schema and under what name you want to save the table (i.e. TT_FIRST_NAME, TT as in Translation Table).
A translation table keeps a copy of the old value (i.e. the original first name) and the new value (i.e. the shuffled first name) of an anonymization function. It also adds the primary key value(s) of the anonymized table. These keys can be used in other functions to find the right anonymized value in the translation table, so another table can be anonymized in the same manner.
Using a translation table
A translation table is often used as input for a value lookup. A translation table enables consistent anonymization throughout a database or chain of databases. It is imperative that the key you use is available in both systems and/or tables. A primary key isn’t always the right key for this, which is why DATPROF Privacy allows you to designate a ‘translation key’. This is a virtual key; no actual constraints will be created in the database but any columns designated as translation key will be added to the translation table. Social security numbers and account numbers, for instance, are good candidates for a translation key.
Using a translation table can be straightforward but it is also possible to combine multiple translation tables into one view or table. For example, you have multiple translation tables as a result of setting multiple functions on a customer table; a first name shuffle, a last name shuffle and a function which generates a new social security number. All of the resulting translation tables will have the same key: the primary key of the customer table and any translation keys you may have defined. Using these keys and a script you can create a table or view which encompasses all of the translation tables. Such a table or view is very useful later on when you apply the exact same anonymization elsewhere in your database using just one function, instead of three.
Your translation tables contain the original values. We often advise clients to treat translation tables as if they contain production data. To minimize the risk, you could place any translation tables in a separate schema with a separate privilege scheme. Going one step further, you could anonymize data on one database and distribute test sets from there, rather than having developers directly access potentially sensitive data.
Frequently asked questions
Some questions are asked on a regular basis. We collected these questions and try to answer them in our FAQ – Data Anonymization DATPROF Privacy article.
Want to try yourself?
With our 14-days free trial you can try DATPROF Privacy yourself. Mask, scramble, blank your own database and see how easy it can be to anonymize your test data.
Click on the download button and start today!
Mask your privacy sensitive data and use it for development and testing.