File Masking

DATPROF Runtime not only enables data masking within databases but also supports the masking of external data files. This functionality allows you to apply the same masking rules used in database environments to (un)structured files, ensuring consistent and secure handling of sensitive information across your entire data landscape.
Using a masking template, Runtime interprets the structure of the file and applies the configured masking rules to the relevant data fields.
File masking is typically used in scenarios where:
Data files are exchanged between systems or teams.
Sensitive content is exported for testing, development, or analysis.
Compliance requires anonymization of personal or confidential information outside the database.
This functionality can be fully automated and integrated into your Runtime workflows, providing a streamlined approach to secure data handling across both databases and files.

Getting Started
This section helps you prepare your environment for creating and running DATPROF File applications. Once your environment is set up and directories are selected, you can proceed to Creating an Application.
Setting up the Environment
Before you can begin building a File Masking application, you'll need to create a group and add an environment. For step-by-step instructions on creating groups and environments, refer to the Groups and Environments section of the Runtime documentation.
Directory Settings
Runtime requires Input, Translation, and Output directories for File Masking. You can verify the directory paths by clicking the “Validate” button. Currently, only local or network files can be masked.
Support for Azure and AWS storage will be added in future releases.
Once the group and environment are in place, you can proceed with defining the file structure, selecting masking rules, and configuring the masking template for your application.

Creating an Application
Once you’ve setup a group and environment, you can start with installing/building an application. You can do this by clicking “Install application” in the Environment overview and then selecting “Create a new application”. This will open the application editor:
The interface of the Application editor is quite intuitive, but we’ll walk you through the key features to ensure you get the most out of it. Start by entering the following information:
Name: Enter a unique and descriptive name for your application. This name will be used to identify your application within the DATPROF Runtime interface.
Version: Specify the version of the application you’re creating. Versioning helps you manage updates and track changes to your application over time.
File Type: Select the file type that you need to process. Currently, Parquet, CSV, XML and JSON Line (JSONL) files are supported.
Support for additional file formats will be introduced in future versions of DATPROF Runtime.
Once you’ve entered the application details, you’re ready to add file patterns and masking functions!

Parquet Files
Runtime masks fields in Parquet files using Spark SQL expressions to locate, filter, and transform specific columns. This distributed SQL engine efficiently handles large-scale data, enabling conditional logic and complex transformations for flexible masking.
File Pattern
The file pattern is used to specify which files should be masked by the application. This pattern tells Runtime how to locate the target files in the file system or data source.
You can use the following approaches when defining your pattern:
Exact file names: Use the full file name if it’s always the same.
Example: CUSTOMERS_10k.parquetWildcard patterns: Use wildcards to match files with dynamic elements, such as timestamps, sequence numbers, or environment identifiers.
Example: CUSTOMERS_*.parquet
Wildcards allow flexibility in identifying files without needing to update the application for every filename change.
Tip: Ensure your pattern is specific enough to avoid unintentionally matching unrelated files, especially in directories containing multiple file types or data sources.
CSV Files
Runtime masks fields in CSV files using Spark SQL expressions to locate, filter, and transform specific columns. This distributed SQL engine efficiently handles large-scale data, enabling conditional logic and complex transformations for flexible masking.

File Pattern
CSV files require some additional configuration. While some default settings are already pre-filled, you may need to adjust the following options depending on your file:
Exact file names: Use the full file name if it’s always the same.
Example: CUSTOMERS_10k.csvWildcard patterns: Use wildcards to match files with dynamic elements, such as timestamps, sequence numbers, or environment identifiers.
Example: CUSTOMERS_*.csvCustom date format: Define the date format used in the CSV so values are interpreted correctly.
Custom timestamp format: Specify how timestamps are written in the file.
Field delimiter: Choose the character (e.g., comma, semicolon, tab) that separates fields.
Has header row: Enable this option if the first row contains column names instead of data.
Advanced Settings

Comment prefix character: Lines starting with this character will be ignored. Leave blank to disable.
Empty string representation: Defines how empty values are represented in the file.
Encoding/Charset: Specifies the character encoding of the file. Common choices include UTF-8 and ISO-8859-1.
Escape character: Character used to escape quotes inside quoted fields. Default: \.
Ignore leading whitespace: Trims whitespace at the beginning of each field.
Ignore trailing whitespace: Trims whitespace at the end of each field.
Infer column data types: Automatically detects and assigns data types (e.g., integer, double, date) for each column.
Malformed rows handling: Defines how malformed rows should be handled:
PERMISSIVE (default): Loads all rows. Missing fields are set to null, and corrupt records are placed in _corrupt_record.
FAILFAST: Stops immediately if any malformed row is encountered.
DROPMALFORMED: Skips rows with schema mismatches without raising errors.
Null value string: Text string to be interpreted as null (e.g., null or NA).
Quote character: Character used for quoting field values, typically ".
Unescaped quote handling: Controls how the parser handles values with unescaped quotes:
STOP_AT_CLOSING_QUOTE: Treats unescaped quotes as part of a quoted value; parsing continues until a matching closing quote is found.
BACK_TO_DELIMITER: Treats the value as unquoted; collects characters until the next delimiter or line ending.
STOP_AT_DELIMITER: Similar to BACK_TO_DELIMITER, but stops at the first delimiter or line end.
SKIP_VALUE: Skips the problematic value and replaces it with the configured nullValue.
RAISE_ERROR: Immediately stops processing and raises an error.
Use multiline fields: Enables support for field values that span multiple lines (e.g., large text blobs wrapped in quotes).
JSONL (JSON Line) Files
Runtime masks fields in JSON Line files using Spark SQL expressions to locate, filter, and transform specific columns. This distributed SQL engine efficiently handles large-scale data, enabling conditional logic and complex transformations for flexible masking.
Important note on attribute ordering:
When processing JSONL files, the order of attributes in each JSON object cannot be guaranteed. Attributes are written out alphabetically. While this does not affect how data reading software interprets the file, it may appear unusual when inspecting the file manually.

File Pattern
When working with JSONL (JSON Lines) files, you can configure the following options to ensure proper parsing and interpretation:
Exact file names: Use the full file name if it’s always the same.
Example: CUSTOMERS_10k.JSONLWildcard patterns: Use wildcards to match files with dynamic elements, such as timestamps, sequence numbers, or environment identifiers.
Example: CUSTOMERS_*.JSONLMulti-line JSON: Enable this if your JSON is formatted across multiple lines or represented as a JSON array. This is required for pretty-printed JSON files.
Date format: Defines how dates are parsed, using Spark datetime patterns. Default: yyyy-MM-dd
Timestamp format: Defines how timestamps with timezone information are parsed. Default: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Timestamp without timezone format: Defines how timestamps without timezone information are parsed. Default: yyyy-MM-dd'T'HH:mm:ss[.SSS]
Advanced Settings

These advanced options provide greater flexibility when working with JSONL files, especially when handling non-standard JSON or malformed input:
Allow comments in JSON: Accept Java-style comments (e.g., //, /* ... */) inside JSON documents.
Allow leading zeros in numbers: Permits numbers with leading zeros (e.g., 00123). Note: not part of standard JSON.
Allow single quotes for strings: Accepts 'single-quoted' strings instead of the standard "double-quoted" strings.
Allow unquoted field names: Accepts field names without quotes (e.g., {name: "John"}). Note: not part of standard JSON.
Corrupt record column name: Defines the column name where DATPROF stores corrupt rows when using PERMISSIVE mode.
Encoding: Character encoding of the file (e.g., UTF-8).
Handle malformed rows: Determines how to process corrupt or malformed JSON rows:
PERMISSIVE (default): Loads all rows, placing corrupt data into the defined corrupt record column.
FAILFAST: Stops immediately when a malformed row is detected.
DROPMALFORMED: Skips malformed rows without raising an error.
Locale: Defines the locale used to parse dates and timestamps.
Parse primitives as strings: When enabled, parses numbers and booleans as strings. Useful for schema consistency.
XML Files
Runtime masks fields in XML files using XPath 3.1 expressions to locate elements or attributes to transform. XPath expressions are provided relative to the context element and can target any depth in the XML structure. This approach allows flexible and precise masking for complex XML hierarchies.

File Pattern
When working with XML files, you can configure the following options to ensure proper parsing and interpretation:
Exact file names: Use the full file name if it’s always the same.
Example: CUSTOMERS_10k.csvWildcard patterns: Use wildcards to match files with dynamic elements, such as timestamps, sequence numbers, or environment identifiers.
Example: CUSTOMERS_*.csvContext Element: In XML, there is no inherent concept of rows like in databases or CSV files. To logically split your XML file into rows, each representing a single record, you must define a context element using an XPath expression. This tells the engine which node should be treated as one logical unit for processing.
For example, in the XML structure below:<pop:PopulationRegistry>
<pop:Person>...</pop:Person>
<pop:Person>...</pop:Person>
</pop:PopulationRegistry>
You can specify the context element as either://pop:Person
, meaning “every<pop:Person>
element is one row”Or the more explicit
/pop:PopulationRegistry/pop:Person
This context element is crucial because all masking functions applied to the file will be evaluated relative to this node.
Timestamp Format: Define the timestamp format for parsing timestamp strings with datetime masking functions.
Default:
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Adding Masking Functions
Let’s start with a simple case where we want to replace all last names inside <pop:LastName>
elements.
We’ll use this XML file as a demo:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="person-style.xsl"?>
<!-- Complex XML Example: Fictive Population Registry -->
<pop:PopulationRegistry
xmlns:pop="http://example.org/population"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://example.org/population population.xsd"
version="2.1">
<pop:Person id="93601">
<pop:Name>
<pop:FirstName>Delmer</pop:FirstName>
<pop:MiddleName>
<![CDATA[Chris]]>
</pop:MiddleName>
<pop:LastName>Fenner</pop:LastName>
</pop:Name>
<pop:BirthInfo>
<pop:DateOfBirth>2005-02-11</pop:DateOfBirth>
<pop:PlaceOfBirth>New York</pop:PlaceOfBirth>
<pop:Country code="USA">United States</pop:Country>
<pop:Gender>M</pop:Gender>
</pop:BirthInfo>
<pop:ContactInfo>
<pop:Email type="personal">DFenner@live.com</pop:Email>
<pop:Email type="personal">Delmer.Fenner@hotmail.com</pop:Email>
<pop:Email type="work">Delmer.Fenner@shell.com</pop:Email>
<pop:Phone type="mobile">+44-7911-123456</pop:Phone>
</pop:ContactInfo>
<pop:Documents>
<pop:Passport SSN="437-02-2223" issuedBy="USA" issueDate="2010-05-20" expiryDate="2025-05-20"/>
</pop:Documents>
<pop:Notes>
<!-- Special notes regarding the person -->
<pop:Note xml:lang="en">Subject relocated to "Canada" in 2019.</pop:Note>
<pop:Note xml:lang="fr">
<![CDATA[Réside à Montréal depuis 2019.]]>
</pop:Note>
</pop:Notes>
<pop:Salary>45393</pop:Salary>
</pop:Person>
<pop:Person id="93602">
<pop:Name>
<pop:FirstName>Irvin</pop:FirstName>
<pop:LastName>Mcgrew</pop:LastName>
</pop:Name>
<pop:BirthInfo>
<pop:DateOfBirth>1948-08-29</pop:DateOfBirth>
<pop:PlaceOfBirth>Berlin</pop:PlaceOfBirth>
<pop:Country code="DE">Germany</pop:Country>
<pop:Gender>M</pop:Gender>
</pop:BirthInfo>
<pop:MaritalStatus status="married" since="2002-06-15"/>
<pop:Employment>
<pop:Occupation>Systems Analyst</pop:Occupation>
<pop:Employer name="Capital One Financial Corp." id="E1003"/>
</pop:Employment>
<pop:Documents>
<pop:Passport SSN="26-98-1475" expiryDate="2028-05-25" issueDate="2011-05-25" issuedBy="USA"/>
</pop:Documents>
<pop:Salary>45897</pop:Salary>
</pop:Person>
<pop:Person id="93605">
<pop:Name>
<pop:FirstName>Bonnie</pop:FirstName>
<pop:LastName>Wiese</pop:LastName>
</pop:Name>
<pop:BirthInfo>
<pop:DateOfBirth>2000-12-25</pop:DateOfBirth>
<pop:PlaceOfBirth>Amsterdam</pop:PlaceOfBirth>
<pop:Country code="NL">Netherlands</pop:Country>
<pop:Gender>F</pop:Gender>
</pop:BirthInfo>
<pop:MaritalStatus status="married" since="2002-06-15"/>
<pop:Employment>
<pop:Occupation>Lieutenant</pop:Occupation>
<pop:Employer name="Marine" id="E1003"/>
</pop:Employment>
<pop:Documents>
<pop:Passport SSN="523-54-3453" issuedBy="USA" issueDate="2011-05-25" expiryDate="2028-05-25"/>
</pop:Documents>
<pop:Salary>58957</pop:Salary>
</pop:Person>
<pop:TotalSalary>150247</pop:TotalSalary>
</pop:PopulationRegistry>
Click Add Masking Function.
Select the function Last name generator.
Define the XPath expression to target the element:
.//pop:LastName
→ Relative path from the context (<pop:Person>
). This finds<pop:LastName>
even if nested inside<pop:Name>
.Alternatively:
./pop:Name/pop:LastName
. This is more declarative but requires you to type the full hierarchy.

Because the context is <pop:Person>
, you can use relative XPath starting with .
.
Conditional Masking
Elements
Sometimes you only want to mask values under certain conditions.
Example: Generate new first names for males only
Add a new masking function.
Select the function First name (male).
Define the XPath:
.//pop:FirstName[../../pop:BirthInfo/pop:Gender = 'M']
.//pop:FirstName
→ finds all first names.[condition]
→ only selects those that match the condition.../../
→ navigates up two levels (from<pop:FirstName>
to<pop:Person>
)./pop:BirthInfo/pop:Gender
→ drills down into<Gender>
.= 'M'
→ applies only when gender is male.
This way, male first names will be replaced with synthetic male names, while female first names remain unchanged.
You can easily clone this function and adjust it for = 'F'
to handle female first names.

Attributes
You can also apply conditions using attributes.
Example: Mask Only Personal Email Addresses
Add a Custom expression masking function
Use the XPath expression to find Email where type = ‘personal’
.//pop:Email[@type = 'personal']
@type
→ means “look at the attribute namedtype
”.This will filter
<pop:Email>
elements wheretype="personal"
.
If you wanted to filter on attribute on a different level for example the id attribute in <pop:Person id="93601">
, you could use:
.//pop:Email[../../@id = '93601']
Custom Expressions
In some cases, you want to generate a new value based on multiple other values.
Example: Constructing an email address from first and last name
concat(substring(.//pop:FirstName, 1, 1), '.', .//pop:LastName, '@datprof.com')
Explanation:
The concat
function combines multiple elements into a single string. In this example, it takes four arguments to construct an email address:
substring(.//pop:FirstName, 1, 1)
→ extracts the first letter of the first name.'.'
→ inserts a literal dot..//pop:LastName
→ appends the last name.'@datprof.com'
→ appends the domain.

Result: For the name “Delmer Fenner”, the output is: D.Fenner@datprof.com
Problem: This example uses the original names. To generate emails from masked names, we need dependencies.
Dependencies
The File Masking engine allows you to determine the execution order between functions.
Example:
Mask first names.
Mask last names.
Generate a new email address based on these masked names.
Because the email function is defined after the other two, it automatically uses the masked values.
So
Delmer Fenner
→Gerry Jagger
.Email:
G.Jagger@datprof.com
.
You can reorder dependencies by dragging functions with the drag indicator before the function name to change the execution order.

Value Lookup
Sometimes you want to replace values with predefined translations from a file (CSV, Parquet, JSON).
Example: Replace SSNs using TT_SSN.csv
customer_id,ssn_masked
93601,"11-22-3333"
93602,"22-33-4444"
93605,"33-33-5555"
Configuration steps:
XPath: point to the SSN attribute in
<pop:Passport>
(e.g.@SSN
) using.//pop:Passport/@SSN
.Define file format and location of the lookup file(CSV in this case).
Input mapping: link/join the XML in the CSV using the
@id
in<pop:Person id=”93601”>
linked to thecustomer_id
column in the CSV file. In this case we can directly use@id
because the context is already//pop.Person
Output mapping: choose
ssn_masked
to replace the SSN value.
Important: The lookup file must be accessible to the DATPROF Runtime application.

Calculating Totals
When masking data, for example a salary, it can be required to recalculate the sum. In the example file below each person has its own salary and at the bottom we have total salary.
Example: Mask salaries and recalculate total salary
Add a Random whole number masking function
Use the XPath expression to find Salary:
.//pop:Salary
Enter a minimum and maximum between 40000 and 50000
Save and close the function
Add a Custom expression masking function
Use the XPath expression to find pop:TotalSalary (The relative . (dot) is not used, because TotalSalary is on higher level than the context node pop:Person
//pop:TotalSalary
For the Expression use the following XPath to sum all
pop:Salary
elements.sum(//pop:Salary)
Masking and Generator functions
Currently, File Masking in Runtime supports 39 masking functions. These functions allow you to apply a wide range of masking techniques, such as character replacement, data encryption, or randomization, to ensure that sensitive information is properly protected while maintaining the integrity and usability of the dataset. Whether you’re masking personal identifiers (PII), financial data, or other confidential information, DATPROF Runtime’s masking functions enable fine-grained control over how data is transformed.
This section will introduce you to the available masking functions for File Masking.
Masking
Masking Functions | Description |
---|---|
Constant value | Creates a constant value in the generated dataset that is identical for every field. I.e. inputting MyFavoriteValue here will generate ‘MyFavoriteValue’ for every field in the resulting field’s dataset. |
Date/time modifier | This function will change the existing date to a fixed day in the same month. Or to a fixed day in the first month of the same year. With this change, in most cases the new values remain functionally viable. |
Value lookup | With this function the replacement value will be obtained from a lookup file or translation file. |
Blank | This function will NULL the selected field(s). |
Custom expression | The Spark SQL custom expression should resolve to a value or contain a function that returns a value. |
Sequential number | Generates a sequential number for every field that starts at a specified value and increments by the step value per field. Additionally, you can define a Padding for your generated integers. This is a set number that will be affixed to the generated integer. For example, using a padding of 3, and a start of 8 with a step of 2 will generate the following: 008 → 010 → 012 |
Sequential date/time | Identical to the Random Date/Time generator except that this generator creates a sequential datetime for every field. Supplying a maximum datetime is optional. A number to increment the starting date is required, and can only accept whole numbers. Any unit of time to increment by can be chosen, from seconds to years. |



Basic Generators
Basic Generators | Description |
---|---|
Random date/time | Generates a random Date/Time value per field between a specified minimum and maximum datetime that corresponds with the underlying field’s datatype. |
Random decimal number | Generates a random decimal number between a supplied minimum and maximum value for every field. Using the Scale setting the decimal accuracy can be defined. For instance, a generator with a minimum value of 0 and a maximum value of 1000 using a scale of 4 might generate 132.4202. |
Random whole number | Generates a random integer between a supplied minimum and maximum value for every field. |
Random string | Generates a random string of lower- and uppercase letters for every field. The minimum and maximum length for strings can be defined. |


Business Generators
Business Generators | Description |
---|---|
Credit card account number | Generates a random credit card account number. Using Issuer(s), you must define one or multiple issuers to determine which syntax the generated account numbers adhere to. |
IBAN (International Bank Account Number) | Generates a valid IBAN number for every field. Using Country Code(s) you can specify which country codes you’d like your resulting IBAN codes to use. |
Currency code | Generates a three letter currency code for every field. |
Currency symbol | Generates a currency symbol for every field. |
User agent | Generates user agents per field, for example: Mozilla/4.0 (compatible; MSIE 5.13; Mac_PowerPC), Opera/8.53 (Windows NT 5.2; U; en). |
A-Number/GBA Number (Dutch township security number) | Generates a 10 digit GBA number per field. |
Genre | Generates a media genre for every field. |
BSN Number (Dutch citizen service number) | Generates a valid Dutch social security number per field. |
SSN (US Social Security Number) | Generates a SSN for every field. You can specify how you want to separate your resulting numbers. You can choose one of the following formats:
|
Job | Generates a profession for every field. Using the Language(s) you can specify which language(s) you want your resulting job names generated in. |
Military rank | Generates a military rank name for every field. Using Department(s) you can specify which branches of the armed forces you’d like to include in your resulting dataset. |


Advanced Generators
Advanced Generators | Description |
---|---|
Regular expression | Generates values based upon a regular expression. The syntax for the regular expression used in Runtime/Privacy is specific to the package we use, so please refer to the specifications here. |
Name Generators
Name Generators | Description |
---|---|
Brand | The Brand generator creates a unique brand name for each field in your dataset. It is particularly useful when working with data that requires distinct and realistic brand names. |
Color code | Generates a color hexcode (Both three-digit shorthand and six-digit full length hexcodes) per field. |
Color | Generates a random color per field. (Ex. Baby Blue, Sky Blue, Soft White) |
Dinosaur | The Dinosaur generator generates a Dinosaur name for each field in your dataset. |
First name | The First name generator creates a unique first name for each field in your dataset. It is particularly useful when working with data that requires distinct and realistic first names. |
Full name | The Full name generator creates a unique full name for each field in your dataset. |
First name (female) | The First name (female) generator creates a unique female first name for each field in your dataset. |
First name (male) | The First name (male) generator creates a unique male first name for each field in your dataset. |
Last name | The Last name generator creates a unique last name for each field in your dataset. |
Random word | The Random word generator generates a random word for each field in your dataset. |


Location Generators
Location Generators | Description |
---|---|
City | Generates a random city name per field. Here, you can specify for which countries you’d like to generate random names using the Countries drop-down menu. |
Company | The Company generator generates a random company name for each field in your dataset. |
Country | Generates a random country name per field. The Language(s) option specifies in which language the country will be written. Multiple options can be enabled simultaneously. |
Street | Generates a random existing street name per field. The Countries option allows you to specify for which country street names will be generated. |
Two letter country code | Generates a random country code per field either in a 2 letter country code. |
Three letter country code | Generates a random country code per field either in a 3 letter country code. |



Advanced Settings
Condition
The Condition tab enables you to apply filters to the selected fields based on specific values. This allows you to define precise masking rules that apply only when certain conditions are met.
For example, in our demonstration, we’ll use the First Name (Female) generator to mask the selected field only when the gender field contains the value "F". This ensures that the masking function is applied selectively, preserving the integrity of other records while maintaining realistic and consistent data.


Translation
You can choose to store the result of the data masking function in a translation file. This is particularly useful in implementing consistent data masking between files. Translation files store the old and the new value for each field value in the file.
The Translation tab inside a masking function lets you create and name translation files:

Note: Comma-separated list of columns that will be used as keys in the translation file. (e.g. id, customerId)
Comment
The Comment tab allows you to add internal notes or documentation related to your File Masking functions. This section is not used during execution but serves as a helpful space to provide context, instructions, or other information for yourself or team members.
Typical uses for the Comment tab include:
Describing the purpose or scope of the masking function.
Documenting important changes or version history.
Adding notes for future maintenance or handover.
Listing assumptions, known limitations, or special handling rules.
Providing clear and concise comments can improve collaboration and make it easier to manage and troubleshoot the application over time, especially in environments where multiple team members are involved.
Deploy and Run
Important note on attribute ordering:
When processing JSONL files, the order of attributes in each JSON object cannot be guaranteed. Attributes are written out alphabetically. This is not a problem for programs that read the data, it may appear unusual when inspecting the file manually.
Once you have created an application, you can deploy and run it to start masking your files. DATPROF Runtime provides two options: running the application directly or integrating it into automated workflows using the API.
Deploy Now
The Deploy Now option allows you to run your File Application immediately. This is the fastest way to check that your configuration is correct and that the masking process runs as expected.
Click “Start...” next to your application.
On the “Deploy now” tab, select Masking to start the application.
Monitor the execution and review the output in the target directory.

API Usage
For integration with other tools, scheduling, or automation, you can deploy and run your file masking applications through the DATPROF Runtime API. With the API, you can:
Trigger masking jobs from external systems.
Automate test data provisioning as part of CI/CD pipelines.

For more information on how to use DATPROF Runtime’s API, please refer to the Using the API section.