General Functionality

This chapter describes the functionality of File Masking in more detail.

Creating a masking configuration

Before a run can be started a configuration file must be created. This configuration file specifies which which parts of the input file must be masked.

Configuration files are placed in the ‘conf’ folder. The following examples showing the different masking functions available.

List of available functions:

Function	Explanation
blank	Sets the value of the matching element to the empty string
scramble	Replaces any 'normal' character with 'x', and any digit with '1'
random-number	Replaces the value with a random number
shuffle	Makes a list of all (unique) values for this path in this file. This list is then shuffled and used as lookup table. Each value is then substituted with the value from this lookup table
random-iban	Replaces the value with a random IBAN bank account number
replace	Replaces the value with the new value if and only if the match property matches the current value.
random-datetime	Replaces a date or date-time field with a (partly) randomized date or date-time
lookup	Replaces a particular set of values with a new set of values using a lookup table
sum	aggregate function adds all (integer) values of all the xml element matching the "source" path and puts the result in all xml elements matching the "target" path.
count	Aggregate function counts the number of xml elements matching the "source" path and puts the result in all the xml element matching the "target" path.

The configuration file uses a special kind of JSON syntax called _HOCON_. This is a kind of JSON syntax but much more relaxed in the syntax it accepts.
See HOCON (Human-Optimized Config Object Notation) for a complete description of the configuration syntax.

There are four sections in the configuration file:

license - To specify the license.
format - To specify the file format.
functions - Specifying the masking functions.
db - Optional properties to specify database connection.

Example configuration file:

CODE

license = "DPFPRV-1.0-invalid_license-PRV-000-00000000-000000000000-000000"

format.type: "csv"
format.encoding = "utf-8"
format.csv: {
    fieldSeparator = "auto"
    recordSeparator = "auto"
    hasHeader = false
}

functions: [
    {
        function: "blank"
        columns: [3]
    }
    {
        function: "shuffle"
        columns: [4]
	}
]

As the supported file types (xml, csv and flr) are very different, the configuration of the masking functions need different properties depending on the file type.

XML

For xml files, (almost) each function must specify the 'path' property. The 'path' property uses a (subset) of XPath expressions to specify the xml element in the input file on which to apply this function.

The following syntax is accepted for 'path' properties:

`/node1`	matches with the direct child node named "node1"
`//node1`	matches with the direct or indirect child node named "node1"
`/node1/@attr1`	matches with an attribute "attr1" on node "node1"
`/node1/[@attr1]`	matches with an element named "node1" that has an attribute "attr1"
`/node1[text()='abc']`	matches with an element named "node1" that has content equal to "abc"
`/node1[@attr1='abc']`	matches with an element named "node1" that has an attribute "attr1" with value "abc"

For example, to specify to scramble only 'Jaap' and not 'Arie' in the following XML:

XML

<persons>
   <person id=123>
     <name>Jaap</name>
   </person>
   <person id=321>
     <name>Arie</name>
   </person>
</persons>

Use the following configuration:

CODE

{
  function: "scramble"
  path: "//persons/person[@id='123']/name"
}

It is also possible to select an element/attribute for which a sibling element has a particular condition.
For this to work we need to specify two additional properties:

base - specifies a base element
conditions - specifies a list of sub-paths beneath "base" that must (all) be true (i.e. match or exist)

When specifying the 'base' property, the 'path' property will now also be a sub-path beneath 'base'.

Example configuration:

CODE

{
  function: "scramble"
  base: "/persons/person"
  conditions: [ "/id[text()='2']" ]
  path: "/name"
}

Used with XML:

XML

<persons>
  <person>
    <id>1</id>
    <name>John</name>
  </person>
  <person>
    <id>2</id>
    <name>Boris</name>
  </person>
  <person>
    <id>3</id>
    <name>Gerry</name>
  </person>
</persons>

Will result in only scrambling "Boris".

CSV

For csv files, each function must specify the 'columns' property. The columns property specifies to which column indexes this function must be applied.
The column index must be in the range: 1 <= i <= n; where i is the column index and n is the number of columns.

Example configuration to scramble all values in the 3th and 5th columns:

CODE

{
  function: "scramble"
  columns: [3, 5]
}

FLR

For flr files, each function must specify a 'match' property and an 'update' property.

match

The 'match' property specifies to which records this function should be applied.

The simplest match is checking if a string occurs at a certain position in the record.
The following 'match' checks if the string "01" occurs at the start of the record. The "start" is the index
where the field starts (1 is the first position), "length" is the length (in characters) and "pattern" specifies a regular-
expression (in this case it is just the string "01").

CODE

match: { field: { start=1, length=2, pattern="01" } }

A more sophisticated match can include multiple fields combined with "and" and/or "or".
Each "and" or "or" property specifies a list of elements that can in turn be a "field", an "and" or an "or".
In the example below, this function will be applied to a record if it starts with "01" OR if it starts with
"02" AND the field at position 5 contains "AA".

CODE

match: {
  or: [
    { field: { start=1, length=2, pattern="01" } }
    {
      and: [
        field: { start=1, length=2, pattern="02" }
        field: { start=5, length=2, pattern="AA" }
      ]
    }
  ]
}

update

The 'update' property specifies the fields to update (and how).

The 'update' property specifies to which fields this function should be applied and how the field value
should be interpreted. The function will be applied to each field individually.
The 'update' property is a list of elements, where each element must contain a "field". Each "field" must
contain the following properties:

start - at which character position this field starts (starting from 1)
length - the length of the field
padding - either "left" or "right"
paddingChar - the padding character (must be exactly 1 character).

Before the function is applied to the value, the padding characters are removed from the field value.
After the function is applied, the new value will be extended with the padding characters until the
specified length.

Complete example scrambling the (numeric) field starting at position 123 with length 10 only for records starting with string "AA":

CODE

{
  function: "scramble"
  match: { field: { start=1, length=2, pattern="AA" } }
  update: [
    { field: { start=123, length=10, padding="left", paddingCharacter="0" } }
  ]
}

Detailed explanation of available functions

The masking functions are specified as an array of objects where each object defines one masking function.
Depending on the file type (xml, csv, flr) different properties need to be specified for the masking functions.

The general properties of each function are described below.

blank

The "blank" function simply sets the value of the matching element to the empty string.
The "blank" function does not have any properties of its own.

scramble

The "scramble" function replaces any normal character with 'x', and any digit with '1'.

Properties:

preserveCase - If "preserveCase" is "true", then "Abc" becomes "Xxx", otherwise it becomes "xxx".

random-number

The "random-number" function replaces the value with a random number between the given min and max values.

Properties:

min - minimum value (optional). The value will be 0 when not specified.
max - maximum value (optional). The default value is the max. value of a long (9,223,372,036,854,775,807).

random-iban

The "random-iban" function replaces the value with a random IBAN bank account number.

This function has no properties.

random-datetime

The "random-datetime" function replaces a date or date-time field with a (partly) randomized date or date-time.

Properties:

format - the date-time format.
This specifies the date or date-time format (input and output) according to the Java 11 DateTimeFormatter class.

Example: format: "yyyy-MM-dd HH:mm:ss.SSS"

randomize - a list of field specifiers to control how each field (day, month, year etc.) should be randomized.
Each field specifies how a particular field of the date/date-time must be randomized.
Possible fields: year, month, day, hour, minute, second, epoch-day.

Each field can have modifiers to control how to randomize the field.
- No modifier means just randomize the field to any valid value.
- Fields not specified will not be randomized.
The "epoch-day" (days since 1st of January 1970) is special as it can effect other (date) fields.
Note that the randomization is applied in the given order.

The result will always be a valid date/time.

Possible modifiers:

+-N : randomize the field between [x - N, x + N] where x is the original value and N some positive number.
For example, when used in "year+-10" and the year value is 2018 then randomized value for the year field will be a year between 2008 and 2028 (inclusive).
=N : set the value for the field to N.
For example: "month=2" will make the month field always be set to 2.
If the value would be too large the value is corrected to be in the valid range to ensure the result will always be a valid date.
[A,B] : randomize the field to a value in the given range.
Example: "hour[8,17]" will randomize the hour field to a value between 8 and 17 (inclusive).

Complete example:

CODE

randomize: ["year+-10", "month=2", "day", "hour[8,17]", "minute", "second=0"]

replace

This function replaces all matching strings with the new value if and only if the match property has a match in the current value.

Properties:

search - is an un-anchored regular expression.
The value is searched with this RE. Each text that matched will be replaced by the value of the "replace" property.
It is also possible to specify a group in the RE. When specified only the text of the first group will be replaced.
See for more information about the syntax of regular expressions, the documentation of the Java 11 Pattern class.
Example: search: ".*" will match the complete value.
replace - the replacement text.
All the text that matches will be replaced with the value of the "replace" property.
Example: replace: "newvalue"

shuffle

The "shuffle" function will randomly reorder all values (of different records). To do this it makes a list of all the (unique) values (stage 1). This list is then combined with a shuffled version of the same list and is used as lookup table (all in memory). Each value is then substituted with the value from this lookup table (stage 2).

lookup

This function replaces a particular set of values with a new set of values using a lookup table. There are two flavors; inline and jdbc.
With inline the lookup table is specified in the configuration file.
With jdbc a lookup table from a RDBMS is used.

inline lookup table

In this case the lookup-table is actually a "map" property specified in the configuration file.

Properties:

type - must be "inline"
map - specifies the lookup table as a list of key-value pairs.
Each key and value field is a list of values.The number of values in each key and value must be the same.Also the match expression (different for each file type) must match with the same number of values as specified in the key and value fields.
Matching values from an input record are checked with each key, when they are equal the values will be replaced by the "value" values.

Example:

CODE

{
  function: "lookup"
  type: "inline" 
  map: [
    { key: ["orig firstname 1", "orig lastname 1"], value: ["new firstname 1", "new lastname 1"] }
    { key: ["orig firstname 2", "orig lastname 2"], value: ["new firstname 2", "new lastname 2"] }
  ]
}

jdbc lookup table

The "jdbc" lookup-table is a table in a relational database accessable using JDBC.
The following JDBC drivers are delivered with File Masking:

Oracle
SQL Server
PostgreSQL
DB2 LUW
DB2 iSeries
MySQL / MariaDB

Properties:

type - must be "jdbc"
db - Selects the JDBC connection properties specified elsewhere in the configuration file.
Example: db: "ora123" selects the JDBC properties starting with `db.ora123`.
tablename - The name of the lookup table in the database.
Example: tablename: "MY_SCHEMA.MY_LOOKUP_TABLE"
keycolumns - The list of key-columns in the lookup table to lookup the original values.
Example: keycolumns: [ "ORIG_1", "ORIG_2" ]
valuecolumns - The list of value-columns in the lookup table containing the replacement values.
Example: valuecolumns: [ "NEW_1", "NEW_2" ]

Example:

CODE

{
  function: "lookup"
  type: "jdbc"
  db: "ora123" 
  tablename: "MYLOOKUP"
  keycolumns: ["orig firstname 1", "orig lastname 1"]
  valuecolumns: ["new firstname 1", "new lastname 1"]
}

sum (only for XML files)

The "sum" aggregate function adds all (integer) values of all the xml element matching the "source" path and puts the result in all xml elements matching the "target" path.

Both "target" and "source" specify the xml element in the same way as the "path" properties of the masking functions. See syntax of the "path" property in the XML section.

Properties:

source - specifies the values to be summed
Example: source: "//amount"
target - specifies the value that will be updated with the sum.
Example: target: "//total-amount"

count (only for XML files)

The "count" aggregate function counts the number of xml elements matching the "source" path and puts the result in all the xml element matching the "target" path.

Properties:

source - specifies the values to count.
Example: source: "//transaction"
target - specifies the value that will be updated with the count.
Example: target: "//number-of-tx"