Inspect privacy sensitiveness, size & quality
How valuable would it be to have insight in your organization’s data sources? Would you treat a source differently if you only need 10% of its data? Does your source contain privacy sensitive data and where is it stored?
Time to get insight in your data!
Do you know your data?
Data landscapes generally have many different data sources like CRM, ERP or HRM systems. Most organizations have multiple systems and multiple copies of these systems. Do you have any idea of the privacy sensitiveness of the different sources in your environment? Some of them probably contain privacy sensitive information and one more than the other. Because you don’t really know, it’s very important to inspect these sources and find out if it indeed contains privacy sensitive information and where it is stored.
Software teams use DTAP environments for development and testing purposes. We see that with these environments multiple copies of the different sources are existing. So now it’s not only the production data that you have to worry about, it’s also all of its copies – sometimes even up to fifteen copies of a single system.
So if you, for instance, would look at a CRM system or an ERP system you’ll see a lot of copies of these systems. One copy could be for testing purposes, another copy could be for acceptance purposes. You could have multiple copies of each system.
Systems tend to change over time, in size for example. This increase in size is something we ongoingly see. So your CRM system might be growing with multiple percentages every couple of months. But that also means that the copies of these different environments grow as well. Ask yourself: do you understand your data? Or is it a black box to you?
Explore the source
Occasionally we see sources that are not privacy sensitive, or that databases are stable in size (not growing) and which is just a single copy. That’s nicely straight forward and easy and often we don’t have to take any actions there. But more often we see the need to explore these sources. The way we explore sources is by generating statistics of the source and then do data profiling. This data profiling gives you an understanding of the level of sensitivity of the different tables and the information.
When we generate statistics, we are really looking for striking, unusual data in sources. This can be anything from remarkably long names, special characters or negative salaries. It is always interesting to see how data sources are filled with different types of data. If you explore your sources, you’ll probably discover anomalies that you never would have thought of yourself like names (privacy sensitive data) in comment fields for example.
Get in control
Data keeps changing. That means you need some kind of reporting in place; a time stamp or other information at a certain point in time about the state of a data source. The availability of statistics and data profiling gives you control over the source. Being in control enables you to develop a strategy on how to handle the source.
Test data insight and control over the source is needed for every test data management process. You can use it as input for your data masking efforts to anonymize the source. Or you could decide how to reduce the size of the source, for a subsetting process for example. It’s also useful for dealing with data quality or other data related questions. It’s valuable to be able to understand the sensitivity and to use these insights to create a strategy around the expansion of the source: would it continue to grow or is it stable in size?
With data insight you have the option of cleaning the source – you can get rid of all anomalies. But if you’re busy with anonymizing a privacy sensitive source (for test data management purposes), you want to do the exact opposite. In that case you want to make sure that all data quality issues are kept intact after anonymizing in order to test with as much as possible ‘production like’ data. For example: a typical last name can already be a privacy issue in itself. Another issue can be that after a data migration ‘interesting’ data remains in your databases. These issues can require different approaches to apply different masking rules. So it’s really about knowing your data and getting in control.
If you have the availability of this information – statistics, profiling – and you know where interesting data is stored in your source, then you have insights and you can define actions to improve it. It also creates value for building subset and masking templates. So if you want to open the black box, you need insight in your test data.
Also watch our our corresponding webinar “How do you explore your privacy sensitive data sources?”
Request a demo
Get new insights in data quality and find out where privacy sensitive information is stored.