Protect personally identifiable information with data masking in non-production databases, comply with legislation and prevent data leaks in QA environments
Nowadays, more and more organizations use dozens of databases and applications for their processes. It´s common to copy those databases for other use than the primary process. The majority create multiple copies of those production systems for different purposes like development, testing, acceptance, training, outsourcing, etc. A lot of these systems contain personally identifiable information (PII) or corporate critic and privacy sensitive data. How do you deal with such challenges? In this solution article we inform you about test data masking in a broad sense.
1. Data masking definition
“Masking or obfuscating data is the process of transforming original data using of masking techniques to comply with data security and privacy regulations.“
This definition is comparable to the one on wikipedia, but we think that you’ll execute these process to get compliant. That’s the reason why we include the compliance with laws and rules (like GDPR, PCI and HIPAA).
There are different terms used interchangeably for the definition of data masking, like data anonymization or data obfuscation. For the convenience, we use the term data masking.
Data masking is the process of hiding personal data. The main reason is to ensure that the data cannot refer back to a certain person. There are different methods for masking data. The method you choose depends on the type of data you want to mask.
Scrambled data in testing
Anonymizing or scrambling production data within non-prod environments is used more and more often. You still have your full data set containing ‘normal’ data, but all sensitivities are modified so it cannot be linked to the original individual.
2. Why data masking?
Masking data is done by organizations to get compliant to laws and regulations or to secure data (against competition). Obfuscating data is mostly done for non-prod purposes like software development, software quality, training, Business Intelligence or marketing. The process isn’t just blanking data fields, it is transforming PII to characteristically irreducible data.
The advantages and benefits
Masking techniques offer several advantages, but the key reason for organizations to start is to reduce data security vulnerability. Protecting customers and citizens is getting more regulated, new data regulations are created or updated. But using obfuscation on personal information ensures software development and software test teams can access the data with a reduced risk.
Masking data is an operation in itself and it needs some attention. The first challenge to be overcome is making private data irreducibly but keeping it as characteristic to production (quality). So making it untraceable and keep it usable for testing. The second challenge is creating masked data consistent over multiple systems and databases. The third is coping with triggers, constraints, business rules and indexes while executing the transformations.
3. Data protection with GDPR
Copying a database means that you now have to secure not one database but for example ten databases. That´s why most governments stated data privacy laws to protect the customers, civilians from wrongdoing. Not protecting their information, you’ll risk the following:
- Not complying with laws and European Union directive concerning data security
- Exposure of personal data to unauthorized users
- Image loss because of bad publicity when data is leaked
- Customers that terminate their relation because of lag of trust in security
Privacy sensitive data
When is information privacy sensitive? A name for example is personal, not necessarily privacy sensitive. The residence isn’t private as well. It is public information. Financial situations (like a huge debt) or a disease makes data sensitive. In this example, by separating names, city, disease and debt, the data cannot refer back to a certain person and therefor it is not identifiable anymore.
When determined which information should be masked or anonymized, you can choose your data masking technique within DATPROF Privacy. A common procedure is to shuffle data like first and last name to get new first / last name combinations. Another procedure to mask data is to blank a column that you don’t need for testing. In that way, private data and all its risks can literally be removed. Scrambling data is another commonly used method to make data unrecognizable: it replaces characters by x and numbers by 1.
Synthetic data generation
Another masking approach is generating synthetic data. This approach can be used in two ways:
- Fill empty (new) databases using synthetically generated data from scratch
- Replace privacy sensitive information with synthetically generated data
When you already have existing data and databases, the big advantage of the latter approach is that schema’s and structures of the original data are preserved when replacing sensitive data with synthetically generated data. With the use of Deterministic Masking it is sure that all data is being replaced with the same generated data consistently, regardless of which application, platform or system the data is in.
Tutorial: how to mask data
Where should you start a project? What things should you keep in mind?
Static versus dynamic data masking
What do these methods look like, what are the characteristics, pros and cons?
When you deploy data scrambling rules to data you’ll end up with representative though unrecognizable data. There are many techniques which can be used. Watch the video to see what this looks like in practice.
There are a number of best practices that are usable for data masking.
On a data level, we mean which (PII) data should be masked. When do you mask enough data to become compliant but keep the test data as representative as possible so the test organization can still use the data as test data?
What’s important is knowing where data is stored. If you know where and how data is stored you’re able to deploy data masking rules. An important best practices on a data level is: do something with date of birth and postal area. If these remain the same, research shows that you’re pretty identifiable.
On an organization level you’re able to discuss where data masking is executed. It is important that it is as secure as possible. Preferably we see it happening in a staging area for example.
There are several reccomendations to be given, maybe the most important one is: try to start simple. We see many organization blowing up the data masking project. Just start in a simple manner and along the way improve the data masking rules. It can be turned into a big operation, which is probably the case. Doing nothing is even worse. So even if your first masking run isn’t 100% perfect – it is better than nothing!
Some other recommendations: start analyzing where data is stored and discuss the masking rules with the CISO (Chief Information Security officer) or DPO (Data Protection Officer). Tell them that replacing data with only ‘xxxxxx’ isn’t going to help the business. Just discover where common grounds can be found. And if you’d like some help, don’t hesitate to contact us.
Compliancy project plan
A masking plan is critical to a successful anonymization project and that’s what this document is designed to help you with. Download the whitepaper for free!
DATPROF applies to the software lifecycle of the database vendors. We want to make sure you can anonymize data in the application of your choice. That’s why we support all major relational db’s as shown in the table below. If your platform isn’t listed, it doesn’t mean that we don’t support it – in most cases we find a way to make it work (or we develop additional support for database scrambling).
|Oracle||11.2 and above|
|Microsoft SQL Server||2008|
|DB2 LUW||10.5 and above|
|DB2 for i||7.2 | 7.3|
|PostgreSQL||9.5 | 9.6 | 10.5 | 11 | 11.2 | 11.6 | 12 | 12.1|
* Check the Powershell module remarks
Many organizations and companies have at least one or more environments containing private data they want or need to mask and protect. This can be in the cloud, on premise or in a flat file. To change or transform this data in a consistent way, specific masking tools can be used.
There are many, many database masking tools. What we distinguish ourselves in is the ease of use of our product DATPROF Privacy and customization service. Every customer and every masking need is different. Therefor, every organization needs a custom made template, which we help to develop in the initial PoC phase. The ease of use of the database masking tool allows the customer to make changes and develop a template of their own.
Try for free
Mask privacy sensitive data and generate synthetic test data with DATPROF Privacy. Try 14 days for free. No credit card required.
What is data masking?
Data masking is the process of hiding personal or privacy sensitive data. The main reason is to ensure that the data cannot refer back to a certain person.
Why data masking?
To protect personal identifiable information, data needs to be anonymized before using it for purposes like testing and development.
How to mask data?
Data can be masked with the help of masking rules (shuffle, blank, scramble) and synthetic data generation. A good data masking tool combines several techniques to build a proper masking template.
When is data privacy sensitive?
A name is personal, but not privacy sensitive. The city that you live in is also not privacy sensitive. It is public information. But the fact that you have a huge debt or a disease makes your data privacy sensitive.