Pseudonymization replaces identifiers with tokens while keeping a reversible mapping, so the data remains personal data under GDPR. Anonymization irreversibly transforms data so individuals cannot be re-identified, placing it outside GDPR’s scope entirely. Choose pseudonymization when you need to preserve data relationships for internal analytics; choose anonymization when sharing data publicly or with untrusted parties. This guide covers implementation code for both approaches, including tokenization, k-anonymity, and differential privacy.
What GDPR Says About These Terms
The General Data Protection Regulation (GDPR) explicitly addresses both concepts in Article 4 and Article 32. Pseudonymization is defined as processing personal data in a manner that prevents attribution to a specific data subject without additional information kept separately. Anonymization goes further—the data must be irreversibly transformed so the individual can no longer be identified.
The key difference in GDPR’s eyes: pseudonymized data remains personal data. Anonymized data falls outside GDPR’s scope entirely. This distinction has significant implications for your architecture decisions.
Pseudonymization: Practical Implementation
Pseudonymization replaces identifiable information with artificial identifiers while maintaining a mapping back to the original data. This allows data utility while reducing direct attribution risk.
Tokenization Example
A common approach is tokenization—replacing sensitive values with random tokens:
import secrets
import hashlib
class Pseudonymizer:
def __init__(self, secret_key: str):
self.secret = secret_key.encode()
self.token_map = {}
def pseudonymize(self, email: str) -> str:
if email not in self.token_map:
token = secrets.token_urlsafe(16)
self.token_map[email] = token
return self.token_map[email]
def hash_pseudonymize(self, email: str) -> str:
return hashlib.pbkdf2_hmac(
'sha256',
email.encode(),
self.secret,
100000
).hex()[:24]
This approach preserves referential integrity—you can still link records across datasets using the token while hiding the original identifier.
Database-Level Pseudonymization
For database systems, consider column-level encryption or masking:
-- PostgreSQL example using pgcrypto
CREATE TABLE users_pseudonymized (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email_hash TEXT NOT NULL,
name_encrypted BYTEA,
created_at TIMESTAMP DEFAULT NOW()
);
-- Store encrypted instead of plain data
INSERT INTO users_pseudonymized (email_hash, name_encrypted)
VALUES (
encode(pgcrypto.digest('user@example.com', 'sha256'), 'hex'),
pgp_sym_encrypt('John Doe', 'encryption_key_here')
);
The original data exists only in your secure key management system. The database contains only transformed values.
Anonymization: Achieving Irreversibility
Anonymization must make re-identification impossible—even with access to all available information. This requires irreversible transformation techniques that preserve statistical utility while eliminating attribution.
K-Anonymity Implementation
A foundational technique ensures each record blends with at least k-1 others:
import pandas as pd
import numpy as np
def generalize_age(age: int, bucket_size: int = 10) -> str:
lower = (age // bucket_size) * bucket_size
return f"{lower}-{lower + bucket_size - 1}"
def generalize_zipcode(zipcode: str, digits: int = 3) -> str:
return zipcode[:digits] + "".join(["0"] * (len(zipcode) - digits))
def apply_k_anonymity(df: pd.DataFrame, k: int = 5) -> pd.DataFrame:
df = df.copy()
df['age'] = df['age'].apply(lambda x: generalize_age(x, 10))
df['zipcode'] = df['zipcode'].apply(lambda x: generalize_zipcode(x, 3))
return df
This generalization ensures records cannot be uniquely identified by combining quasi-identifiers (age, zipcode, gender).
Differential Privacy for Aggregates
When computing statistics over datasets, differential privacy adds calibrated noise:
import numpy as np
def laplace_mechanism(true_value: float, epsilon: float) -> float:
scale = 1.0 / epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
def private_count(query_func, epsilon: float = 1.0) -> float:
true_count = query_func()
return laplace_mechanism(true_count, epsilon)
This guarantees that the presence or absence of any single record cannot be inferred from the output.
Key Differences at a Glance
| Aspect | Pseudonymization | Anonymization |
|---|---|---|
| GDPR Status | Personal data | Not personal data |
| Reversibility | Reversible with key | Irreversible |
| Implementation | Tokenization, encryption | Generalization, aggregation |
| Data Utility | High—maintains relationships | Lower—statistical only |
| Use Case | Analytics with reduced risk | Public releases, research |
When to Use Each Technique
Choose pseudonymization when you need to perform analytics while reducing exposure, maintain audit trails or transaction links, or enable data processing by third parties with reduced risk.
Choose anonymization when you need to share data publicly or with untrusted parties, publish research datasets, or comply with data minimization principles where later identification is unnecessary.
Implementation Considerations
Regardless of technique, consider these practical factors:
Pseudonymization requires secure key storage. Key compromise defeats the entire protection mechanism, so use hardware security modules (HSMs) or dedicated key management services.
When pseudonymizing across multiple tables, use consistent mapping to preserve relationships. An email-to-token mapping applied consistently maintains referential integrity.
Test anonymization by attempting re-identification using available auxiliary information. If records can be re-identified, the technique is insufficient.
Document pseudonymization keys and anonymization methods. This supports accountability requirements under GDPR Article 30.
Pseudonymization provides a practical balance for most developer scenarios—operational capability is maintained while data subject risk is demonstrably reduced. Anonymization becomes necessary when data leaves a controlled environment or when true data minimization is the goal. Apply both as layers in your data processing architecture based on your use case, regulatory requirements, and acceptable risk profile.
Related Reading
- Bitwarden Vault Export Backup Guide: Complete Technical.
- GDPR Joint Controller Agreement Template: A Developer Guide
- VPN Warrant Canary: What It Means and Why It Matters
Built by theluckystrike — More at zovo.one