Protect personally identifiable information with data masking in non-production databases, comply with legislation and prevent data leaks in QA environments
Nowadays, more and more organizations use dozens of databases and applications for their processes. It´s common to copy those databases for other use than the primary process. The majority create multiple copies of those production systems for different purposes like development, testing, acceptance, training, outsourcing, etc. A lot of these systems contain personally identifiable information (PII) or corporate critic and privacy sensitive data.
Data masking or anonymization revolves around altering the data in a way that it remains useful for testing and development, but the identification of a person becomes almost impossible. This article will explain the basics of anonymization and what you can do today to start moving towards anonymized non-production environments to protect privacy sensitive data. So, how to mask test data?
What is Data masking?
“Masking or obfuscating data is the process of transforming original data using of masking techniques to comply with data security and privacy regulations.“
This definition is comparable to the one on wikipedia, but we think that you’ll execute these process to get compliant. That’s the reason why we include the compliance with laws and rules (like GDPR, PCI and HIPAA).
There are different terms used interchangeably for the definition of data masking, like data anonymization or data obfuscation. For the convenience, we use the term data masking.
The data masking meaning is the process of hiding personal data. The main reason is to ensure that the data cannot refer back to a certain person. There are different methods for masking data and data masking techniques. Also a distinction can be made between dynamic data masking and static data masking. The method or technique you choose depends on the type of data you want to mask.
Scrambled data in testing
Anonymizing or scrambling production data within non-prod environments is used more and more often. You still have your full data set containing ‘normal’ data, but in the masked data all sensitivities are modified so it cannot be linked to the original individual.
Why should I mask my data?
There are several reasons why organizations start masking their data:
- It is a solution to risks like, data leakages, data loss, data breach;
- When data is masked, it helps to get compliant to laws and regulations;
- Protect sensitive data. Masked data is essential for protecting and securing data against competition;
- It helps to have representative data for software development and quality purposes (and training, Business Intelligence or marketing).
The process isn’t just blanking data fields, it is transforming PII to characteristically irreducible data.
The advantages and benefits
Masking techniques offer several advantages, but the key reason for organizations to look for data masking benefits is because they want to reduce data security vulnerability. Protecting customers and citizens is getting more regulated, new data regulations are created or updated. But using obfuscation on personal information ensures software development and software test teams can access the data with a reduced risk.
Masking data is an operation in itself and it needs some attention, especially when you have complex data. The first challenge to be overcome is making private data irreducibly but keeping it as characteristic to production (quality). So making it untraceable and keep it usable for testing. The second challenge is creating masked data consistent over multiple systems and databases. The third is coping with triggers, constraints, business rules and indexes while executing the transformations.
How to become compliant with data protection (GDPR)?
Copying a database means that you now have to secure not one database but for example ten databases. That´s why most governments stated data privacy laws to protect the customers, civilians from wrongdoing. Not protecting their information, you’ll risk the following:
- Not complying with laws and European Union directive concerning data security
- Exposure of personal information to unauthorized users
- Image loss because of bad publicity when data is leaked
- Customers that terminate their relation because of lag of trust in security
Privacy sensitive data
When is information privacy sensitive? A name for example is personal, not necessarily privacy sensitive. The residence isn’t private as well. It is public information. Financial situations (like a huge debt) or a disease makes data sensitive. In this example, by separating names, city, disease and debt, the data cannot refer back to a certain person and therefor it is not identifiable anymore.
How to mask data?
First thing you need to do is discover whether you have personal information in your databases at all. If you do, how sensitive is this data? The sensitivity and the rules related to the sensitivity vary from country to country. A name in itself is not that sensitive as a person’s address. The sensitivity isn’t (in most cases) in identifying data. The sensitivity comes with what we call characteristic or descriptive data. For example, whether some has an illness or is €500.000 in debt makes the data valuable and sensitive. Knowing that somebody is called John Doe and that he lives in Amsterdam is (mostly) public information. A mere search on Google will reveal this information easily. What you want to do is keep the descriptive data, but cut the link with the actual person. This is done by changing the identifying data. So where to start?
Start by identifying the systems that contain personal data. When you know what systems contain personal data, then you can get into more detail. What data does this particular system contain and what do we want to do with it? What data needs protection, encryton What action to take depends on a couple of things. First is the information security policy. Most organizations have such a policy. Some policies prescribe the baseline for data that should be anonymized. On the other hand you have the needs of the testing community.
What kind of masking methods are there?
When determined which information should be masked or anonymized, you can choose the method you want to use. In general, we see two data masking technologies to anonymize data, namely synthetic data generation and data masking (or data obfuscation). Data masking uses functions like data shuffling, scrambling and others. You can distinguish the two techniques by stating that masking is the reuse or modification of data in the databases and that generation is the creation of data that does not yet exist.
The synthetic data generation approach can be used in two ways:
- Fill empty (new) databases using synthetically generated data from scratch
- Replace privacy sensitive information with synthetically generated data
When you already have existing data and databases, the big advantage of the latter data masking technology is that schema’s and structures of the original data are preserved when replacing sensitive data with synthetically generated data.
When there has been determined what data needs to be anonymized, you can start specifying how it should be anonymized. The development of the masking template starts. What anonimization rules are you going to use? DATPROF Privacy has some built in masking functions you can start out with:
The most used built-in function is the shuffle. A shuffle takes the distinct values of one or more columns and rearranges them randomly. For example, by shuffling first and last names separately, you get new first name / last name combinations.
The blank function is self-explanatory. The blank removes (blanks) a column. This leaves no data, so this is only usable for columns not used in testing.
The scramble function replaces characters by x and numbers by 1. This function leaves no recognizable data, so the scramble too gives a result which can’t be used by testers.
The value lookup uses a reference table as input to anonymize the values in a table. The function needs a reference key, i.e. a customer id, to find the right data. This function is commonly used as part of a setup that keeps data consistent. Most of the times this setup also uses a translation table.
A random lookup also uses a reference table, but uses it in a different way. A random lookup replaces values by randomly selecting data from another table. This can be useful if you want to add test cases to existing data. For example, your data doesn’t have any diacritics and you want to add these to the first name data. Then you can use a reference table comprised of all different names, including those with diacritics, and use this as lookup.
First day in month / year
Most people do not realize that a birthdate combined with a postal code is very identifying. This first date function makes it possible to change the date of birth to the first of the month or year. By doing this, there is less variation and therefore it is harder to find a specific person.
The above mentioned functions will not work in all situations. To add some extra flexibility you can use the custom expression function. This gives you the possibility to make your own functions. Whether this is the composition of an email address or something more advanced, the custom expression lets you do everything you can do in the SELECT of a SQL Statement.
Next to the standard masking functions, DATPROF Privacy also has built-in synthetic data generators which replace the existing privacy sensitive data with synthetically generated (fake, dummy) data. It depends on your test needs if you want to use masking functions, generate synthetic data or a combination of these to anonymize your data.
Also read: Synthetic test data versus data masking
Mask test data end-to-end
In today’s databases some values are stored more than once. The complexity starts when (test) data should consistently be masked over multiple systems. For example, a person’s name might be stored in the customer table as well as in the billing table. Data masking becomes challenging when multiple applications or sources should to be masked. For end-to-end testing it is vital that data is masked in the same order in the sources and applications.
To enable this, DATPROF Privacy can save the translation of an anonymization to a separate table. This feature can be found in the function editor, under the tab Translation table. Here you can enable or disable the creation of a translation table. When enabled, you can select in which schema and under what name you want to save the table (i.e. TT_FIRST_NAME, TT as in Translation Table).
A translation table keeps a copy of the old value (i.e. the original first name) and the new value (i.e. the shuffled first name) of an anonymization function. It also adds the primary key value(s) of the anonymized table. These keys can be used in other functions to find the right anonymized value in the translation table, so another table can be anonymized in the same manner.
Using a translation table
A translation table is often used as input for a value lookup. A translation table enables consistent anonymization throughout a database or chain of databases. It is imperative that the key you use is available in both systems and/or tables. A primary key isn’t always the right key for this, which is why DATPROF Privacy allows you to designate a ‘translation key’. This is a virtual key; no actual constraints will be created in the database but any columns designated as translation key will be added to the translation table. Social security numbers and account numbers, for instance, are good candidates for a translation key.
Using a translation table can be straightforward but it is also possible to combine multiple translation tables into one view or table. For example, you have multiple translation tables as a result of setting multiple functions on a customer table; a first name shuffle, a last name shuffle and a function which generates a new social security number. All of the resulting translation tables will have the same key: the primary key of the customer table and any translation keys you may have defined. Using these keys and a script you can create a table or view which encompasses all of the translation tables. Such a table or view is very useful later on when you apply the exact same anonymization elsewhere in your database using just one function, instead of three.
Your translation tables contain the original values. We often advise clients to treat translation tables as if they contain production data. To minimize the risk, you could place any translation tables in a separate schema with a separate privilege scheme. Going one step further, you could anonymize data on one database and distribute test sets from there, rather than having developers directly access potentially sensitive data.
Another way to mask test data consistently over multiple systems or (cloud) applications is with deterministic data masking. With deterministic masking a value in a column is replaced with the same value whether in the same row, the same table, the same database/schema and between instances/servers/database types. Thanks to deterministic masking, no translation tables are needed anymore.
When you deploy data scrambling rules to data you’ll end up with representative though unrecognizable data. There are many techniques which can be used. Watch the video to see some data scrambling examples and how to mask pii data.
Data masking best practices
There are a number of data masking best practices that are usable for data masking. On a data level, we mean which (PII) data should be masked. When do you mask enough data to become compliant but keep the data as representative as possible so the test organization can still use the data as test data?
There are also some data masking challenges to keep in mind. What’s important is knowing where data is stored. If you know where and how data is stored you’re able to deploy data masking rules. An important best practices on a data level is: do something with date of birth and postal area. If these remain the same, research shows that you’re pretty identifiable.
On an organization level you’re able to discuss where data masking is executed. It is important that it is as secure as possible. Preferably we see it happening in a staging area for example.
Tips for data masking
There are several recommendations to be given, maybe the most important one is: try to start simple. We see many organization blowing up the data masking project. Just start in a simple manner and improve along the way. It can be turned into a big operation, which is probably the case. Doing nothing is even worse. So even if your first masking run isn’t 100% perfect – it is better than nothing!
Some other recommendations: start analyzing where data is stored and discuss the masking rules with the CISO (Chief Information Security officer) or DPO (Data Protection Officer). Tell them that replacing data with only ‘xxxxxx’ isn’t going to help the business. Just discover where common grounds can be found. And if you’d like some help in discovering data masking software, don’t hesitate to contact us.
Compliancy project plan
A masking plan is critical to a successful anonymization project and that’s what this document is designed to help you with. Download the whitepaper for free!
DATPROF applies to the software lifecycle of the database vendors. We want to make sure you can anonymize data in the application of your choice. That’s why we support all major relational db’s as shown in the table below. If your platform isn’t listed, it doesn’t mean that we don’t support it – in most cases we find a way to make it work (or we develop additional support for database scrambling).
|Oracle||11.2 and above|
|Microsoft SQL Server||2008|
|DB2 LUW||10.5 and above|
|DB2 for i||7.2 | 7.3|
|PostgreSQL||9.5 | 9.6 | 10.5 | 11 | 11.2 | 11.6 | 12 | 12.1|
* Check the Powershell module remarks
Data masking tools
Many organizations and companies have at least one or more environments containing private data they want or need to mask and protect. This can be in the cloud, on premise or in a flat file. To change or transform this data in a consistent way, specific masking solutions can be used.
There are many, many data masking tools – or data obfuscation tools. What we distinguish ourselves in is the ease of use of our product DATPROF Privacy and customization service. Every customer and every masking need is different, demanding different approaches. Therefor, every organization needs a custom made template, which we help to develop in the initial PoC phase. The ease of use of the database masking tool allows the customer to make changes and develop a template of their own.
Try for free
Mask privacy sensitive data and generate synthetic test data with DATPROF Privacy. Try 14 days for free. No credit card required.
What is data masking?
Data masking is the process of hiding personal or privacy sensitive data. The main reason is to ensure that the data cannot refer back to a certain person.
Why data masking?
To protect personal identifiable information, data needs to be anonymized before using it for purposes like testing and development.
How to mask data?
Data can be masked with the help of masking rules (shuffle, blank, scramble) and synthetic data generation. A good data masking tool combines several techniques to build a proper masking template.
What data masking techniques are there?
Shuffle, blank, scramble are the best known and simplest techniques. More ingenious masking techniques are lookups, custom expressions and replacing data with synthetically generated (fake) data.
When is data privacy sensitive?
A name is personal, but not privacy sensitive. The city that you live in is also not privacy sensitive. It is public information. But the fact that you have a huge debt or a disease makes your data privacy sensitive.
What is deterministic data masking?
With deterministic masking a value in a column is replaced with the same value whether in the same row, the same table, the same database/schema and between instances/servers/database types. This way you can easily mask the data consistently over multiple systems.