Making Test Data from Real Data

	Making Test Data from Real Data

Abstract

When you need to test your application with realistic data and check the main business processes before publishing to end-users, it is often necessary to anonymize user data and use it during the whole development and testing processes.

Data depersonalizing is the replacement of existing sensitive information in test or development data with programmatically generated information. Users of test, development, or training databases do not need to see the actual information as long as what they have appears real and consistent.

The introduced program implements a simple replacement algorithm that can depersonalize both structured and non-structured data, including XML, Email messages, and SQL scripts.

This is the second article from the series that demonstrates the updated user interface that allows you to flexibly set up the data replacers chain, define the replacing rules, save the configured replacers to the profile files, and much more.

Read the part I from the series: Depersonalizing Your Test Data

We will show you how to mask sensitive data, such as E-mail addresses, phone numbers, physical addresses, social security numbers, postal codes in XML and HTML documents, SQL scripts, and any other text-plain files. Using the DataDepersonalizer tool, you can even replace the real IP addresses with looking realistic generated data.

Let us go ahead and look at how to set up the data source, define the replacement rules, and start the depersonalization process.

Download on GitHub

Step 1 – Define Data Source

When you run the DataDepersonalizer tool, you can see the tab page that allows you to define both the data source directory and the destination directory, which is used for saving masked data.

This page allows you to define the file encoding parameter and specify if the BOM information should be saved to the resulting text file.

There is one important option on this page: you can specify, whether text files within the source directory have linked data, such as names, E-mail addresses, postal codes, order numbers, etc. If it is yes, each unique data items will be replaced with corresponding generated values. If two files from the source directory contain, e.g., the E-mail address "steve@domain.com", both addresses will be replaced with the same fragments of corresponding generated values. Otherwise, a new E-mail address will be generated per each file.

Define Data Source Example

Step 2 – Define Replacement Rules

On this page, you can define one or more data replacers, specify the order in which the replacers will mask each file from the source directory, and set up the individual data replacement rules for each replacer within the chain.

Depending on the data type and its structure, you need to use different algorithms for extracting and replacing sensitive values in this data. The DataDepersonalizer tool offers a wide range of data extracting algorithms. It supports XML and HTML documents, text files, containing the "name:value" pairs, and also any text-based data files. Each data extraction algorithm is implemented in a corresponding data replacer.

Suppose, you have a set of XML report files from your E-commerce platform that provides information about your customers. For testing your report system, you need to mask real data, such as E-mail addresses, phones, user and company names, registration IDs, VAT IDs, etc.

You can add the following replacers to your replacement rules: "E-mail Address Replacer" and "XML Document Replacer".

The "E-mail Address Replacer" provides the only one parameter - the replacement mask. You can specify, e.g., a static anonymous address, such as "someuser@example.com". This value will be used for each email within your XML document. If you want to dynamically change this value, you can use the following format: "someuser{0}@example.com". The "{0}" parameter will be replaced with a constantly increased numeric value. The initial value for this parameter can be specified using the "Start from Number" textbox.

Define E-mail Replacement Mask

The "XML Document Replacer" allows you to set up a list of "Node Name"/"Replace Mask" parameters. Let us add the following items here:

   "Phone" / "+12345678901"

   "LastName" / "Lastname{0}"

   "Company" / "Company Name {0} Inc."

   "Street" / "Lakeside Street {0}"

   "PostalCode" / "1234-567"

   "VatId" / "123456789{0}"

You can add any desired XML node name which value you wish to replace.

Define XML Nodes Replacement Rules

If you need to mask some specifically formatted data, you may add "Regex Replacer" and specify custom regular expressions for both locating and replacing sensitive data. E.g., for replacing the URL query string parameters, you can use the following Regex pattern: "trackingid=[0-9]{6}&sid=[0-9]{9}". This pattern will extract the strings that look like the following:

http://....trackingid=1000211&sid=2000000211

The replacement mask may look like the following: "trackingid={0:D6}&sid={0:D9}". It makes the replacer to insert 6 generated digits for the "trackingid" parameter, and 9 digits for the "sid", correspondingly.

Define Custom Regular Expressions

The "Name/Value Pair Replacer" allows you to specify a list of names and a corresponding list of replacement masks using the replacement mask format described above.

The "HTML Document Replacer" utilizes the HTML tag ID for locating personal data to be replaced.

There are two more data replacers: "Text Replacer" and "HTML Text Replacer". Both these replacers allow you to replace statically defined values (named as Keys) using a corresponding replacement mask.

You can leave the key list empty to replace the same data with corresponding values which were generated by the previous replacers.

Additionally, the "HTML Text Replacer" applies the HTML Entity encoding to the replaced values.

Step 3 – Start Depersonalization

Finally, when you specified the data source and the replacement rules, you can start the data depersonalization process. Here, you can define the log file name to which the DataDepersonalizer will save information about the current progress. Simply press the "Start" button and see the results.

Note, if you use the same directory for both the data source and the destination, make you sure that you have a copy of your original data.

Start Depersonalization Example

Download Source Code

Download Stable Release

Download Source Code

Ask a Question

License Information

The DataDepersonalizer application is distributed under the GNU LESSER GENERAL PUBLIC LICENSE Version 3.

COPYING.txt

COPYING.LESSER.txt

Sergey Shirokov
Clever Components team
www.clevercomponents.com