DATPROF Privacy: Basic Online Training
Data Masking & Synthetic Data Generation
Instructions:
- Watch both training videos.
- Complete the short quiz located below the training videos.
- After submitting the quiz form, scroll down for your score.
- If you successfully pass the test you will qualify to receive and share your DATPROF Basic certificate on LinkedIn. In the event that your score falls below 80%, you will be granted the opportunity to retake the quiz after a review of the training material.
Intro to the DATPROF Platform
Basic Training Video DATPROF Privacy
Video transcript
Hi everyone and welcome to today’s video in which we’ll be taking a look at the basics of DATPROF Privacy, our very own data masking and synthetic data generation tool.
With it, you, the user, can create high-quality automated test data across your entire organization.
In this video, we’ll start by setting up a database connection within Privacy. After this, we’ll import all the necessary meta data, define a few basic functions on our data, quickly going over conditions, consistency, and dependencies, and finally, execute a run. Without further ado, let’s switch over to Privacy.
Alright, after having successfully installed Privacy and opening it for the first time, we can start creating a project. To do this, we’ll press the New Project button here. Supply a name for our project. Let’s call it MyTestProject.
A quick query if I want to overwrite a project that already exists with this name, which I do. Once we have created a project, we’ll be landed in the project settings page. This is a page that allows us to connect to a database, using the edit connection button.
Import the meta data that we want to use during our project creation and sync meta data as our data model changes over time to make sure that the project remains up to date.
Let’s press the edit connection button. Here, we see the connection editor. This has two main portions. On the left hand side, we have all of the saved connections. So connections we’ve used before and that we have elected to save. And on the right hand side we have the project connections. This is a selector where we can select a database type and then supply values in order to connect to the database. Depending on which database we choose, the values we need change depending on what the database expects us to supply.
Normally, we would supply values like the host and port of the database here and then we’re able to test the database connection, save it as a saved database connection on the left hand side here, or clear our selection and create a new connection. For now, let’s select a database connection I’ve already saved in the past. We can also press the test button here, which will confirm that we can connect to the database I’ve configured, which is good. Then we’ll use this connection.
After this, we need to import the meta data that we need to use in the project. Pressing the import meta data button, opens the import wizard which shows us an overview of all the tables that we were able to pull from the database. You can see here we have a customers table, locations table, etcetera. On the right hand side, we have a similar overview that is currently empty. All the tables in this overview are tables that are currently inside the project. What we can do is select a table from the left hand side, then press the single arrow to move it into our project.
If we were to store this, we would have a project with a single table in it. The contacts table. We can also select multiple using the shift key and moving them around with the single button. Or, select them from the right hand side and move them back. Using the double button, it’s also possible to move all of the tables from one side to the other side. For instance, move all into the project or move them all back. For now, we’ll move all the tables into the project, except for one that I’ve intentionally created to leave out. The my_numbers table.
Press next. Now we have a similar menu where, instead of tables, we’re importing foreign key definitions. Which are important to maintain the validity of the database. Let’s import all of these. Press next. Now we get a short overview of the process of importing these definitions. Press next. See there we no issues. Can also check the logging by folding open this button. Everything seems to have gone fine. Press finish. Let’s say the data model changes over time. For instance, we add columns to tables in the database, or we create tables or drop tables. What we can do is we can press the sync meta data button in order to, once again, connect to the database. Where we can now compare what’s known in the database to what’s known in the project.
The my_numbers table that we’ve left out earlier, is still in the database. We still detect it when we want to sync our meta data. If we were to press next here, we can elect to import it and include it to the project. We can also do this for changed tables or new foreign keys or changed foreign keys. We don’t need to do that for now. Let’s press cancel.
Alright, at this point, we’ve created a project, we’ve connected to the database, we have all the meta data that we want to use. We can start handing on over to the masking portion of Privacy now, or we can start defining some functions in order to alter our data. As we open the masking menu, we can see that the tables we’ve imported in the previous step are included in an overview on the left hand side.
If we’ve imported tables across multiple schemas, we can use the drop down box in order to select which schema we want to view. Likewise, if we want to search for various tables inside of the selection, we can use the search bar above.
Let’s select the customers table here. Once we do, you can see that the overview in the center of this screen now displays all the information about the various columns that are inside of this table. Not only does this show the name of the various columns, but also the data type and any other information you might need in order to define functions.
Starting off, let’s select our bank column. Within the contacts of our data model, the bank column is the column that contains the information about the names of banks that our customers use. So this might be something like Lloyds Bank, Deutsche Bank or the ING Bank.
In order to add a function, we simply select the column and press the add function button, and press shuffle. Doing this opens the function editor. The function editor is a vehicle for modifying any functions you define, and changes depending on which function you select. For instance, for the shuffle function, we simply select which columns we want to shuffle, and then, based on that selection, a bank that was formerly ING Bank for a specific customer, might be shuffled around to become something like Deutsche Bank or Lloyds Bank. Vice versa, other banks also change and shuffle around. This makes it so that the banks that are in the data set after we masked, are no longer retraceable to customers.
We can also quickly go over, while we’re looking at this function, is condition. A condition is a defined clause, a SQL clause you can enter, that allows you to filter the results of any function to only trigger on those rows. For instance, if I were to say here: where customer ID is, for instance, equal to 2000, then only one row would be masked in this entire data set. This can be desirable if you’re using multiple functions on a single column. Or if you know that you only want to mask certain rows or if you know that certain rows are important to mask and others are not so important. For now, let’s just press OK.
After this, you can start looking at a data field, for instance, the date of birth. This is also information that we don’t want to remain in our test database. Selecting the date_of_birth column, press the add function and then use the fixed day in the same month or year function. This is a very simple function. What this does is it simply looks at dates in a column and then resets them back to the same day in the same month. Press OK.
And lastly, we want to change the first name of customers that are in this data set. We select the first_name column, press the add function and now we specify a generation function. As you can see in the function editor, there are a wide array of various different generators that we can select from. This may seem daunting at first, but for now, let’s just pick a simple generator and go over how this works. Let’s select the dinosaur generator. What this does is it goes over all the first name column values that are in our data set, and one by one replaces them with random dinosaur names. For instance, a customer that was Joey originally, might become a Brontosaurus. Or a customer who’s name is Fred, might become a Spinosauros. While we’re looking at this, it might be good to go over deterministic masking as well.
There are two various ways of generating. Firstly, we can do random generation, which means that every single time we execute this template, the generation is ran. So, Joey might become a Spinosaurus in the first deployment, but then if we reset the database and execute again, Joey might become a Megoraptor. Which may be undesirable. What we can also do is we can mask deterministically. What this means, is that we define a Salt Value, which is then used for hashing.
For instance, we can just replace this with MySalt. And once we do, this means that all the original values that are identical and have the same salt, will always be masked in the same way. So if we have for instance Joey in this database, and we execute the first run, he’ll become a Brontosaurus. But Joey will also become a Brontosaurus on the second run, and the third run, and so on. And this even goes across multiple various database types. So for instance if we were to use the same data model on an Oracle database, and on a MySQL database. Or for instance a PostgreSQL database. As long as we have the value Joey, and we’re using the same generator, across all of these databases, no matter how many times we deploy, Joey will always become a Brontosaurus, which may be desirable.
Lastly, let’s quickly go over the dependencies. Dependencies simply indicate that we want certain functions before or after other functions. Likewise, with the condition, this may be desirable if we have multiple functions on a single column. So let’s simply just define a condition and save this.
So we can see here on the left hand side, and on the right hand side, all of the functions we’ve defined before. So let’s say for instance, we want to execute the fixed day in month on the date of birth column first, and then we want to do the first name generator, where we generate dinosaur names, and afterwards we want to shuffle the banks. Press OK.
So now that we have a defined project with a few functions we’re ready to deploy and actually modify the data in the database.
For this, there are two options, we can either deploy the template directly in privacy or we can generate a package and execute this project in run time, executing projects within runtime is a few advantages over executing privacy. Primarily that the runtime application can be distributed to multiple users, which allows you to distribute the ability to create test data to end users or multiple administrators.
On top of this, the runtime environment allows you to organize and maintain your existing projects through things like version control and the grouping of multiple projects under a single group or environments to start, we need to generate a package to do this. We can simply press the project menu button in privacy, the top left corner and generate for run time and press generate. So as you can see automatically the file explorer opens and it shows us the generated file. The zip file is the template file that we’ll be using.
As we switch over to runtime, we can select the group that we’re using here. MS SQL server and then the environment we’re using, which is a database connection. So let’s go for the SQL server 2022 production. And once I enter this, you can see that we already have our project uploaded here. So we have my test project version 1.0 0.0.
So what we can do either for a new application, we can press the install application button or we can also upgrade an existing application into a new version. So let’s upgrade over installing. Once you press upgrade, we get a short selection of all the various applications that we’ve already uploaded.
But what we’ll do is we’ll open the file explorer and we’ll drag and drop our file into run time and let’s make this version two, then we press upload and upgrade and now we can go to the settings tab see if all the settings are correct.
This is a place where you can edit various execution variables and other things so that you can change the behavior of an execution. This all looks fine to me. So I press save changes and now we can execute a run to do this. We simply press start and then select mask.
So once this starts going, you’ll see that we output the logging of the various steps that we’ll be taking such as the various system steps of creating tables, using tables and executing the actual functions. So you’ll see us go through here any moment. And as these steps go by, we can open up the individual steps and see what the longing for these steps were. So here we have one step which I know will fail the safety deal action, which I’ve intentionally made a small error in.
So in a moment, this will fail. And then we can see what logging is being outputted to see why this is being caused. So if you click on the step, we get shown the logging for various steps and we can see what went wrong and then we can decide what to do with this. So we can either retry this module or we can ignore it for. Now, let’s just ignore this and go through.
Now, you can see other modules are being processed and because this takes a while, let’s cut to when the run is finished. And then we can go over some of the logging information.
All right, the run is done. And now we have all the various information about how this run is proceeded. So as you can see, 184 modules were completed without any issue, one of them resulted in an error and a few of them were ignored. So once the run is done, we can check these various log steps for any information about how the run is completed. But we can also open things like the deployment logging to see all the individual steps that we’ve taken during the execution and we can open the audit reports.
So let’s take a look at this. The audit report is a short file that includes all of the audit information about this run. So we can see how many rows are affected for various functions, whether every step was successful. And this is kind of a key part of maintaining this kind of administrative side of things in order to make data masking, not only functional but also provable to things like security officers or people who are in charge of the data ownership.
So this has been a short introduction to both privacy and runtime. And even though we’ve gone through it in a pretty fast pace, this should give you all of the information that you need. In order to get started masking data. On top of this, there’s a lot more information available both in our documentation and other video footage. So please feel free to take a look at that and try and absorb as much of that in order to get the most out of using our tools.