Anonymization Terms and Techniques

Jan Štěpnička

Generally anonymization means conversion of personal data into anonymized data by using various anonymization techniques. The EU GDPR regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data quite clearly. All companies dealing with personal data of EU citizens have to respect those rules.

Anonymization terms and properties

Pseudonymization – GDPR defines the term pseudonymization: the anonymized personal data cannot be attributed to a specific data subject without the use of additional information. It means there is still a possibility to re-identify the original data from the additional data that should be kept separately.

Reversibility – if it is possible to get the original value or at least how difficult it is. This is something different from pseudonymization, because here is supposed you don’t have the (secret) additional information. In other words: if it is possible to break the algorithm.

Repeatability – same value will be anonymized to the same anonymized value again. I.e. rerun of anonymization will always produce the same result. Or some value will be replaced in different tables always to the same anonymized number. This is usually very important to anonymize IDs.

Uniqueness – two different original values will be anonymized into two different anonymized values. In other words such anonymization is bijective function, i.e. one-to-one correspondence. This is quite important to anonymize IDs for example.

Preserve data type – anonymize values keeps the data type. So an integer for example cannot be anonymized to a string or the anonymized timestamp must be again valid timestamp.

Preserve length – anonymized values are of the same length or keeps the maximal length of given data type. This is very important for anonymization for testing purposes.

Salt is an arbitrary additional string or value added to given data, usually before doing checksum. It is necessary for example when creating hash of some short or somehow bounded value, e.g. phone number. By the information that some hash is made by sha256 from a phone number, one can use sha256 to produce translation table with all possible values. But when some salt, like string "M8SKC7WOFP975WUS", is added to phone number, then without knowing this salt, usual checksum sha256 is not possible to revert.

Choosing the appropriate anonymization approach and techniques highly depends on the purpose and type of processing of personal data e.g. running production systems, archiving or providing data to partners or development teams. The responsibility lies on the Data controller (the natural or legal person, public authority, agency ...) who determines the purposes and means of the processing of personal data.

Anonymization Techniques

Scrambling – means permutation of letters. But quite often it is possible to revert the original data. Example:

Peter Sellers ---> Teepr Resells

Shuffling – permutes values within a whole column. Example of shuffling ID:

id	name		id	name
2	Pierre Richard		5	Pierre Richard
3	Richard Matthew Stallman	→	2	Richard Matthew Stallman
5	Donald E. Knuth		3	Donald E. Knuth

In general this technique is not repeatable, but it is bijection.

Randomization – simply replace the original value by any random one. Example:

1st run Michael Raynolds ---> 5dtZ4twxx7896avkf78ad+0p 2nd run Michael Raynolds ---> 6shk8t9we6fgos7rthj98d

It is clear that randomization is not repeatable and also not bijective function.

Encryption – uses a key to encrypt the original value. Then such key must be kept secret or can be deleted immediately, depends on the purpose, if we’d like to be able to decrypt the data or not.

Masking – allows an important/unique part of the data to be hidden with random characters or other data. For example a credit card number:

9370 4442 9037 4197 ---> **** **** **** 4197

The advantage of masking is the ability to identify data without manipulating actual identities.

Tokenization – replaces sensitive data with non-sensitive substitutes, referred to as tokens, and usually stored in some secret mapping table. Tokenization keeps the data type and usually also length of data, so it can be processed by legacy systems that are sensitive to data length and type. That is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden.

Table with overview of anonymization techniques

technique	property
technique	revertible	uniqueness	pseudo- nymization	repeatable	preserve data type	preserve length
scrambling	often yes	not	yes	depends on algorithm	yes	yes
shuffling	not	yes	not	depends on algorithm	yes	yes
randomization	not	not	not	not	yes	yes
encryption	not *)	not	yes, but that’s the purpose here	yes	not	not
masking	not	not	not	yes	mostly yes	yes
checksum	often yes	almost yes	yes	yes	not	not
salted checksum	not *)	almost yes	not	yes	not	not
tokenization	not *)	depends on algorithm	yes	yes	yes	yes
data type preserving anonymization	not *)	almost yes	not	yes	yes	yes
data type preserving unique anonymization	not *)	yes	not	yes	yes	yes

*) Additional information must be kept somewhere secretly. Either permanently or temporarily. Like a token table, an encryption key or a salt.

Anonymization Terms and Techniques

Jan Štěpnička

Anonymization terms and properties

Anonymization Techniques

Table with overview of anonymization techniques

Contact Us