Settings
When starting a new project, the Settings screen is where we configure the foundational details of the Analyze project, such as selecting the database to connect to and choosing the analysis profiles to apply. Additionally, the Settings screen allows you to import metadata from the source database and synchronize it if any changes have occurred since the project was created.
Since we need to know which database we’re connecting to, the logical starting point is the Connection Editor. To access it, simply click Edit Connection.
Connection Editor
To analyze a database or specific parts of it, a connection to the database is required. This connection facilitates importing and synchronizing metadata and is essential for executing Analyze functions on the database.
Datprof Analyze supports the following types of databases:
Oracle Database
Microsoft SQL Server
DB2 for Linux, Unix & Windows
DB2 for iSeries
MySQL/MariaDB
The Connection Editor allows you to define, select, and save database connections. On the right-hand side of this screen, the Project Connection section provides an overview of the database connection details used for the current project.
Managing your sensitive data when contacting DATPROF
It’s entirely possible that when depending on our staff for support of our systems, the topic of transferring your project file to DATPROF comes up. Luckily, due to the way we design our software we don’t handle the data within your database, and a project doesn’t contain any sensitive data about your datasets. However, because it’s possible to store and save the connection data to databases in our software a project file can contain privacy or security risk bearing connection data. Therefore, if you would like us to look at one of your project from a support perspective we strongly urge you to delete any existing connection data in the Connection Editor.
As this is information we have no desire to have within our system, and could contain sensitive information, we will prune this information in any event this is attached erroneously to protect our data privacy conformity. Nevertheless, it falls to the submitter of any ticket to be wary of privacy sensitive data and remove it where possible.
Project Connection
Different database types require specific variables to establish a connection, so the user must select the desired database type from the dropdown menu at the top of this section. Below is a brief overview of the required connection data for each database type. Once you’ve reviewed and confirmed your connection details, you can click "Use This Connection" to apply the current configuration and close the dialog, or click "Cancel" to discard any changes.
Test
After specifying the connection parameters, clicking this button will attempt to establish a connection to the database. If successful, a notification titled "Connection Established" will appear. If the connection test fails, Analyze will provide a detailed error message explaining what went wrong.
Save
Once the connection parameters have been specified, clicking this button will add the defined connection details to the Saved Connections section of the Connection Editor. This feature is particularly useful for users who frequently work with the same database across multiple projects. After saving, selecting an entry from the Saved Connections will automatically fill in the connection data in the Project Connection tab. You can then simply test the connection and use it for your project.
It’s important to note that the password is not saved by default, so the user will need to enter the password each time Analyze is started. However, this can be changed by ticking the "Save password" option.
Where are my saved connections stored?
Once a connection is saved, it is written to a text file named "Analyze_connections.lst" in the same local environment where Analyze is installed. Typically, this file is located in the following directory:
”<Windows User>\AppData\Local\DATPROF”
You can easily share this file with collaborators or replace it as needed, which can speed up the process of connecting to the same database, especially when multiple databases are in use.
New
Clicking this button will clear any connection data currently entered in the Project Connection tab, allowing you to start fresh and enter new connection details without having to manually remove each field.
Oracle
Username: The user name to login.
Password: The password.
Type: Direct allows the user to enter the individual data to connect to a database. Connect Descriptor allows the user to paste a connection string to be used to validate the connection instead.
Host: Hostname or IP-address of the database server.
Port: TCP Port.
SID / Service: Unique database identification.
Microsoft SQL Server
Host: Hostname or IP-address of the database server.
Port: Port of the SQL Server database.
By default this is 1433.
When empty, the default port will be used.Instance: If a named instance is used, fill in this field with the name of the instance. This property is optional, then the default instance on the host is used.
Database: Name of the database
Use Windows Authentication: Use your windows authentication to log on to the SQL Server database.
When Windows Authentication is left unchecked:
Username: The user name to login.
Password: The password.
PostgreSQL
Username: the username used on the postgreSQL database for authentication.
Password: the password used on the postgreSQL database for authentication.
Host: the hostname or IP address of the postgreSQL database.
Port: the port on which the database is available. By default for PostgreSQL databases this is port 5432.
Database: the name of database which you'd like to use in this project. This should be identical to the name in the actual database.
DB2 for Linux, Unix and Windows
Host: Hostname or IP-address of the database server.
Port: TCP port on which the database server is listening.
Database: Name of the target database.
Username: The user name to login.
Password: The password.
MySQL / MariaDB
Username: the username used on the MySQL/MariaDB database for authentication.
Password: the password used on the MySQL/MariaDB database for authentication.
Host: the hostname or IP address of the MySQL/MariaDB database.
Port: the port on which the database is available.
Database: the name of database which you'd like to use in this project. This should be identical to the name in the actual database.
Configuration, Profiles and Comments
Now that we've gone over most of the basic functionality, let's take a look at the Configuration tab, the Profiles tab, and Comments.
Configuration
Analyze attempts to maximize performance by running multiple parallel processes. This is a way to execute multiple different processes at the same time by opening multiple connections to the database. This is a value you can tweak yourself. The more parallel processes, the faster Analyze will run. Excessively high numbers of parallel processes can cause some unwanted behavior during execution of templates depending on the specifications of your local database and server hardware. By default this is set at 8. For most use-cases, this is plenty. If you experience issues during deployment (Deadlocks are a good example), a good first step can be to turn down the number of parallel processes to see if this impacts the deployment of a template.
Setting maximum parallel processes to 1
If you set the maximum allowed parallel processes to 1 the performance of any project will tank considerably, as this is an incredibly inefficient way of performing tasks. Therefore, we don't advise ever doing this aside for if you need to test certain performance related issues.
Profiles
Profiles are expressions which allow you to identify your data. What this means in layman's terms is that you create a function which checks a certain condition, and if this condition is true it will mark any location in the database (per table column) where this condition is true. This may sound daunting at first, but once you understand the core reasoning behind these profiles, these become self-explanatory. Profiling is primarily a tool made to identify the location of (sensitive) data, and not meant to address data quality issues. This is an important distinction to make. Using profiles allows us to see where potential problems lay in our data. After this, it becomes easier to resolve such issues manually through SQL commands or other means, or report the findings to others. As such, profiles are also an easy way to visualize to a layman which database columns have potential issues.
In order to start executing our profiles we need to know (roughly) what our database looks like. This can be done by Gathering statistics. This is explained in the chapters to come. Once this is done, we can start taking a look at how to configure our profiles. Below is an image of what a typical profile configuration might look like.
Let's take a look at the various fields and how to enter them to get the most out of their usage.
Profile name
This is the actual name given to a profile. This is also the text that will later be displayed on any column that fulfills this profile's criteria. As such, it's important to give profiles a short, descriptive name that explains what a fulfilled condition means on a column.
Examples of profile names
Let's imagine we're trying to build a profile that checks whether a column contains IBAN (European bank account) numbers. An example of a bad name would be something like "Regex1" or "Profile1", as this name showing up on a specific table doesn't actually tell the user anything. A good name would something along the lines of "IBAN_PRESENT". This conveys to the user in a glance what's wrong with a marked column without having to check the profiles tab.
Type
There are three types of profiles present within Analyze, each described briefly below.
Values
The type Values checks whether one of the given values is exactly present in the database. An important distinction here is that this only returns as True when the exact value is found in a column. This is mostly used for columns where you know exactly what can be entered, or have data that is recurring. Examples of this would be gender, multiple-choice data fields, etc etc.
Examples of values
If we configure our profile to check for value spaghetti, it will go through any column in search of rows that contain spaghetti, and only spaghetti. If a column contains the defined value plus additional information, it does not match. For instance, if this profile would check a column with content "spaghetti bolognese", it would return as False.
Contains
The type Contains checks whether (part) of the definition is present in any column. This does not require an exact match. If you're looking for a specific value which always starts with a specific string of text or number series, this can help you sort through large text fields quickly.
Examples of contains
If we configure our profile to check for contains spaghetti, it will go through any column in search of rows that contain spaghetti anywhere in the field, even if other data is also present in the field. If a column contains the defined value plus additional information, it will still match. For instance, if this profile would check a column with content "spaghetti bolognese", it would return as True.
Regex
Regex (Regular Expression) is a way to search within text strings, and is a widely used standard which can be consulted on http://www.rexegg.com/. Using regular expression allows a user to define very specific functions which check for certain patterns within columns.
Definition
A definition is either one of multiple values to check against, or a regular expression function. For Values and Contains, when supplying multiple strings to check against, it is important to realize that Analyze expects a few things:
Commas between values to be checked for.
No spaces between values or commas. In order to use multiple values follow this example → Value1,Value2,Value3,Value4
No spaces at the end of a collection of values.
For Regex type profiles this is where you paste your regular expression.
Percentage
A function has to be supplied a percentage. This percentage is the amount of positive returns within a specific column before the profile is applied to it. The usage for this is to determine how much of a specific occurrence is considered significant. For instance; it can be reasonably expected that for certain profiles, it's not interesting to mark a column as faulty if only 2% of columns contain matching values.
Examples of percentages
Imagine we define a profile which checks whether a column contains "spaghetti". A configuration of that would look like the following:
If we run profiling on a database, it will check all columns whether at least 40% of all rows of that column contain "spaghetti" anywhere in it. If this is the case, a profile will be applied to that column.
If a column named Italian_Recipes contains 1000 rows, and 350 of them contain the word "spaghetti", then the profile is not applied. This is because 350/1000 rows = 35%, which is less than the 40% we specified.
If the same column contained 400 rows that did contain "spaghetti", it would apply the profile to this column. If the same columns contained 500, 800, or even all rows, it would also apply the profile.
Description
The description field allows you to enter a description for what your profile is meant to do or other information regarding it. This is a simple text field.
Comment
Here you can enter your comments to keep track of the work you have done on any given project.
Importing Metadata
Using the button Import metadata (or using the top-left menu option Project → Import and synchronize → Import metadata wizard), a wizard for importing meta data of the specified database is started. Importing the metadata through this wizard consists of three steps:
Selecting which tables and to import (and from which schema)
Selecting which foreign keys to import;
Starting the metadata import.
Step 1: Selecting the schemas and tables
During this first step, the user can select the tables to be imported. The screen consists of two halves.
The left part shows the metadata available in the database. In the drop-down box (top left) the scheme containing the tables can be selected. In the list below the user can select one or more tables which will be imported.
The right part shows the selected tables to imported.
With the single arrow ( > ), the selected tables are added to the list of tables to import. The double arrow ( >> ) will add all tables from the selected schema in the list of tables to import.
The Next button moves the wizard to the next step, selecting all tables moved to the right-hand side as tables to be imported.
Step 2: Selecting relations (foreign keys)
The next dialog allows the user to select the foreign key relations (FKs) to import. This dialog has the same format as the previous dialog. The list of available FK’s are derived from the selected tables. Again, the user can select one or more FK’s and add them to the list FK’s to import.
Normally, it's desirable to import all foreign key relations.
The Next button moves to the next step.
Step 3: Importing the specified metadata
The last dialog of the wizard allows the user to actually import the selected metadata (tables and FKs). By clicking on Start, the application will retrieve the selected metadata from the database.
The progress of the import can be monitored with the Progress Bar.
By clicking on Logging, the application shows the logging during import. This logging also provides input for resolving any errors that occurred during the import.
Synchronizing metadata
The Metadata Synchronization Wizard helps the user to process any changes to the data model of the target database, ensuring that anonymization can be executed on a changed data model. Synchronization can take place through a DME file or an ODBC database connection (Project Connection).
In the example below we use the ODBC Database connection.
The following screen shows a summary of the differences between the metadata from the target database and the metadata already imported into the application.
Changes may include tables, columns and Foreign Keys. In time they can be added, changed or deleted.
The Synchronization wizard does not distinguish between self-defined Foreign Keys and removed Foreign Keys from the database. They are both marked for deletion in the wizard.
By clicking next, a dialog will be opened for each type of change, where the user can decide whether he wants to process the change in his project.
These dialogs have the same format as the dialogs for importing metadata.
The last screen allows the user to process the selected changes in his project.
The progress of the synchronization can be monitored with the Progress Bar.