Generally anonymization means conversion of personal data into anonymized data by using various anonymization techniques. The EU GDPR regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data quite clearly. All companies dealing with personal data of EU citizens have to respect those rules.
Anonymization terms and properties
Pseudonymization – GDPR defines the term pseudonymization: the anonymized personal data cannot be attributed to a specific data subject without the use of additional information. It means there is still a possibility to re-identify the original data from the additional data that should be kept separately.
Reversibility – if it is possible to get the original value or at least how difficult it is. This is something different from pseudonymization, because here is supposed you don’t have the (secret) additional information. In other words: if it is possible to break the algorithm.
Repeatability – same value will be anonymized to the same anonymized value again. I.e. rerun of anonymization will always produce the same result. Or some value will be replaced in different tables always to the same anonymized number. This is usually very important to anonymize IDs.
Uniqueness – two different original values will be anonymized into two different anonymized values. In other words such anonymization is bijective function, i.e. one-to-one correspondence. This is quite important to anonymize IDs for example.
Preserve data type – anonymize values keeps the data type. So an integer for example cannot be anonymized to a string or the anonymized timestamp must be again valid timestamp.
Preserve length – anonymized values are of the same length or keeps the maximal length of given data type. This is very important for anonymization for testing purposes.
Salt is an arbitrary additional string or value added to given data, usually before doing checksum. It is necessary for example when creating hash of some short or somehow bounded value, e.g. phone number. By the information that some hash is made by sha256 from a phone number, one can use sha256 to produce translation table with all possible values. But when some salt, like string "M8SKC7WOFP975WUS", is added to phone number, then without knowing this salt, usual checksum sha256 is not possible to revert.
Choosing the appropriate anonymization approach and techniques highly depends on the purpose and type of processing of personal data e.g. running production systems, archiving or providing data to partners or development teams. The responsibility lies on the Data controller (the natural or legal person, public authority, agency ...) who determines the purposes and means of the processing of personal data.
Anonymization Techniques
Scrambling – means permutation of letters. But quite often it is possible to revert the original data. Example:Shuffling – permutes values within a whole column. Example of shuffling ID:
id | name | id | name | |
---|---|---|---|---|
2 | Pierre Richard | 5 | Pierre Richard | |
3 | Richard Matthew Stallman | → | 2 | Richard Matthew Stallman |
5 | Donald E. Knuth | 3 | Donald E. Knuth |
In general this technique is not repeatable, but it is bijection.
Randomization – simply replace the original value by any random one. Example:
It is clear that randomization is not repeatable and also not bijective function.
Encryption – uses a key to encrypt the original value. Then such key must be kept secret or can be deleted immediately, depends on the purpose, if we’d like to be able to decrypt the data or not.
Masking – allows an important/unique part of the data to be hidden with random characters or other data. For example a credit card number:
The advantage of masking is the ability to identify data without manipulating actual identities.
Tokenization – replaces sensitive data with non-sensitive substitutes, referred to as tokens, and usually stored in some secret mapping table. Tokenization keeps the data type and usually also length of data, so it can be processed by legacy systems that are sensitive to data length and type. That is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden.
Table with overview of anonymization techniques
technique | property | |||||
---|---|---|---|---|---|---|
revertible | uniqueness | pseudo- nymization |
repeatable | preserve data type |
preserve length |
|
scrambling | often yes | not | yes | depends on algorithm |
yes | yes |
shuffling | not | yes | not | depends on algorithm |
yes | yes |
randomization | not | not | not | not | yes | yes |
encryption | not *) | not | yes, but that’s the purpose here |
yes | not | not |
masking | not | not | not | yes | mostly yes | yes |
checksum | often yes | almost yes | yes | yes | not | not |
salted checksum | not *) | almost yes | not | yes | not | not |
tokenization | not *) | depends on algorithm |
yes | yes | yes | yes |
data type preserving anonymization |
not *) | almost yes | not | yes | yes | yes |
data type preserving unique anonymization |
not *) | yes | not | yes | yes | yes |
*) Additional information must be kept somewhere secretly. Either permanently or temporarily. Like a token table, an encryption key or a salt.