Depersonalizing Your Test Data

	Depersonalizing Your Test Data

Abstract

If you need to use real production data to test applications, any sensitive data should be removed before loading it into the development environment.

In this article, we introduce a method for depersonalization of personal data that protects personal data information. This method is implemented as software using the C# programming language.

The program implements a simple replacement algorithm that can depersonalize both structured and non-structured data, including XML, Email messages, and SQL scripts.

Read the part II from the series: Making Test Data from Real Data

Download on GitHub

GDPR and Data Depersonalization

Under the European Union’s General Data Protection Regulation (GDPR), companies are committed to prevent the accidental loss, distribution or unauthorized access to customers’ personal data. Many companies collect and process huge quantities of customers’ information. This information may contain a large amount of personal, sensitive data.

Which data should be depersonalized? Any data anonymization approaches seek to conceal identity i.e. any kind of identifiers. Identifiers can be applied to any person, alive or dead, including their dependents, ascendants and descendants. Family names, patronymic, first names, maiden names, postal and email addresses, phone numbers, social security numbers, credit card and bank account numbers, IP addresses, etc. are required to be removed or replaced before using by developers, testers, or any third party during the application development life cycle.

There are different methods for depersonalizing or anonymizing data: replacement, scrambling, masking, blurring, encryption, and etc. Some of these methods can be sometimes reversible; the others may break the structured data integrity.

There are many tools on the market that successfully do all this work, including paid and open-source solutions.

This article provides a simple replacement algorithm that can depersonalize both structured and non-structured data, including XML, Email messages, and SQL scripts.

Data Depersonalization in C#

The program can load and process multiple files with data. It may be XML, Email, or SQL scripts with Insert/Update statements.

In case of XML, the algorithm looks for specific XML nodes, such as <Phone> and <FirstName>, composes new values, and finally, replaces the nodes together with their values within the XML source:

// [C#]
private string ReplaceXmlNode(string nodeName, string replaceWithMask, string msgSource)
{
   var xmlNodes = ExtractXmlNodes(nodeName, msgSource);

   foreach (var node in xmlNodes)
   {
      var replaceWith = String.Format(replaceWithMask, startFrom++);
      msgSource = msgSource.Replace(node, String.Format("<{0}>{1}</{0}>", nodeName, replaceWith));
   }

   return msgSource;
}

private string ReplaceXmlNodes(string msgSource)
{
   if (txtXmlNodes.Lines == null || txtReplaceWith.Lines == null)
      return msgSource;

   if (txtXmlNodes.Lines.Length != txtReplaceWith.Lines.Length)
      throw new Exception("The number of XML nodes must be the same as Replace With values");

   for (int i = 0; i < txtXmlNodes.Lines.Length; i++)
   {
      msgSource = ReplaceXmlNode(txtXmlNodes.Lines[i], txtReplaceWith.Lines[i], msgSource);
   }

   return msgSource;
}

Regular expressions are used for extracting the required XML nodes:

// [C#]
private string[] ExtractData(string text, string matchPattern)
{
   var regex = new Regex(matchPattern, RegexOptions.IgnoreCase);
   var matches = regex.Matches(text);
   var list = new List<string>();

   foreach (Match match in matches)
   {
      var data = match.Value;

      if (list.IndexOf(data) < 0)
         list.Add(data);
   }

   return list.ToArray();
}

private string[] ExtractXmlNodes(string nodeName, string text)
{
   string matchPattern = @"<" + nodeName + ">(.*?)</" + nodeName + ">";
   return ExtractData(text, matchPattern);
}

A similar approach can be used for extracting and replacing Email addresses. The code below demonstrates how to use regular expressions for locating the Email addresses within the document:

The IP addresses are extracted using the other match pattern - the ExtractIpAddresses pattern.

The attached C# program implements a simple algorithm for generating the data replacement. It utilizes special templates with the constant and variable parts applied to sensitive data, which will be replaced:

// [C#]
private string ReplaceEmails(string msgSource)
{
   var emails = ExtractEmails(msgSource);

   foreach (var email in emails)
   {
      var depersonalizedEmail = String.Format(txtReplaceMask.Text, startFrom++);
      msgSource = msgSource.Replace(email, depersonalizedEmail);

      var encodedEmail = GetEncodedEmail(email);
      msgSource = msgSource.Replace(encodedEmail, "");
   }

   return msgSource;
}

This approach doesn't break the data integrity, because all entries with the same values are replaced with corresponding generated values.

Additionally, there is an algorithm that can replace Base64-encoded sensitive data.

The mentioned program implements the data replacement approach for the data depersonalization process. It is possible to extend the algorithm and implement more complicated and/or artificial intelligence algorithms that can identify and anonymize sensitive data and also make it more realistic for testing purposes.

Download Source Code

Download on GitHub

Ask a Question

License Information

The DataDepersonalizer application is distributed under the GNU LESSER GENERAL PUBLIC LICENSE Version 3.

COPYING.txt

COPYING.LESSER.txt

Sergey Shirokov
Clever Components team
www.clevercomponents.com