Depersonalizing Your Test Data |
||
GDPR and Data DepersonalizationUnder the European Union’s General Data Protection Regulation (GDPR), companies are committed to prevent the accidental loss, distribution or unauthorized access to customers’ personal data. Many companies collect and process huge quantities of customers’ information. This information may contain a large amount of personal, sensitive data. Which data should be depersonalized? Any data anonymization approaches seek to conceal identity i.e. any kind of identifiers. Identifiers can be applied to any person, alive or dead, including their dependents, ascendants and descendants. Family names, patronymic, first names, maiden names, postal and email addresses, phone numbers, social security numbers, credit card and bank account numbers, IP addresses, etc. are required to be removed or replaced before using by developers, testers, or any third party during the application development life cycle. There are different methods for depersonalizing or anonymizing data: replacement, scrambling, masking, blurring, encryption, and etc. Some of these methods can be sometimes reversible; the others may break the structured data integrity. There are many tools on the market that successfully do all this work, including paid and open-source solutions. This article provides a simple replacement algorithm that can depersonalize both structured and non-structured data, including XML, Email messages, and SQL scripts. Data Depersonalization in C#The program can load and process multiple files with data. It may be XML, Email, or SQL scripts with Insert/Update statements. In case of XML, the algorithm looks for specific XML nodes, such as <Phone> and <FirstName>, composes new values, and finally, replaces the nodes together with their values within the XML source:
Regular expressions are used for extracting the required XML nodes:
A similar approach can be used for extracting and replacing Email addresses. The code below demonstrates how to use regular expressions for locating the Email addresses within the document: The IP addresses are extracted using the other match pattern - the ExtractIpAddresses pattern. The attached C# program implements a simple algorithm for generating the data replacement. It utilizes special templates with the constant and variable parts applied to sensitive data, which will be replaced:
This approach doesn't break the data integrity, because all entries with the same values are replaced with corresponding generated values. Additionally, there is an algorithm that can replace Base64-encoded sensitive data. The mentioned program implements the data replacement approach for the data depersonalization process. It is possible to extend the algorithm and implement more complicated and/or artificial intelligence algorithms that can identify and anonymize sensitive data and also make it more realistic for testing purposes.
Download Source Code
License InformationThe DataDepersonalizer application is distributed under the GNU LESSER GENERAL PUBLIC LICENSE Version 3.
Sergey Shirokov
|