Data Masking – the basics

Anonymization revolves around altering the data in a way that it remains useful for testing, but the identification of a person becomes near impossible. This document will explain the basics of anonymization and what you can do today to start moving towards anonymized non-production environments.

First thing you have to do is determine whether you have person data at all. If you do, how sensitive is this data? The sensitivity and the rules related to the sensitivity vary from country to country. A name in and of itself is not that sensitive as is his or her address. The sensitivity isn’t (in most cases) in identifying data. The sensitivity comes with what we call characteristic or descriptive data. For example, whether some has an illness or is €500.000 in debt makes the data valuable and sensitive. Knowing that somebody is called John Doe and that he lives in … is (mostly) public information. A mere search on Google will reveal this information. What you want to do is keep the descriptive data, but cut the link with the actual person. This is done by changing the identifying data. So where to start?

Start by identifying the systems that contain personal data. When you know what systems contain personal data, then you can get into more detail. What data does this particular system contain and what do we want to do with it? What action to take depends on a couple of things. First is the information security policy. Most organizations have such a policy. Some policies prescribe the baseline for data that should be anonymized. On the other hand you have the needs of the testing community.

When there has been determined what data should be anonymized, you can start specifying how it should be anonymized. What techniques are you going to use? DATPROF Privacy has some built in scrambling functions you can start out with:

Shuffle

The most used built-in function is the shuffle. A shuffle takes the distinct values of one or more columns and rearranges them ad random. For example, by shuffling first and last names separately, you get new first name last name combinations.

Blank

The Blank function is self-explanatory. The Blank removes (blanks) a column. This leaves no data, so this is only usable for columns not used in testing.

Scramble

The scramble replaces characters by x and numbers by 1. This function leaves no recognizable data, so the scramble too gives a result which can’t be used by testers.

Value lookup

The value lookup uses a reference table as input to anonymize the values in a table. The function needs a reference key, i.e. a customer id, to find the right data. This function is commonly used as part of a setup that keeps data consistent. Most of the times this setup also uses a translation table.

Random lookup

A random lookup also uses a reference table, but uses it in a different way. A random lookup replaces values by randomly selecting data from another table. This can be useful if you want to add test cases to existing data. For example, your data doesn’t have any diacritics and you want to add these to the first name data. Then you can use a reference table comprised of all different names, including those with diacritics, and use this as lookup.

First day in month / in year

Most people do not realize that a birthdate combined with a postal code is very identifying. This First Date function makes it possible to change the birthdate to the first of the month or year. By doing this there is less variation and therefore it is harder to find a specific person.

Custom expression

The abovementioned functions will not work in all situations. To add some extra flexibility you can use the Custom Expression. This gives you the possibility to make your own functions. Whether this is the composition of an email address or something more advanced, the Custom Expression lets you do everything you can do in the SELECT of a SQL Statement.

Related

Don't miss anything

Signup for our newsletter