Pseudonymization replaces identifiers with tokens while keeping a reversible mapping, so the data remains personal data under GDPR. Anonymization irreversibly transforms data so individuals cannot be re-identified, placing it outside GDPR’s scope entirely. Choose pseudonymization when you need to preserve data relationships for internal analytics; choose anonymization when sharing data publicly or with untrusted parties. This guide covers implementation code for both approaches, including tokenization, k-anonymity, and differential privacy.

What GDPR Says About These Terms

The General Data Protection Regulation (GDPR) explicitly addresses both concepts in Article 4 and Article 32. Pseudonymization is defined as processing personal data in a manner that prevents attribution to a specific data subject without additional information kept separately. Anonymization goes further—the data must be irreversibly transformed so the individual can no longer be identified.

The key difference in GDPR’s eyes: pseudonymized data remains personal data. Anonymized data falls outside GDPR’s scope entirely. This distinction has significant implications for your architecture decisions.

Pseudonymization: Practical Implementation

Pseudonymization replaces identifiable information with artificial identifiers while maintaining a mapping back to the original data. This allows data utility while reducing direct attribution risk.

Tokenization Example

A common approach is tokenization—replacing sensitive values with random tokens:

import secrets
import hashlib

class Pseudonymizer:
    def __init__(self, secret_key: str):
        self.secret = secret_key.encode()
        self.token_map = {}
    
    def pseudonymize(self, email: str) -> str:
        if email not in self.token_map:
            token = secrets.token_urlsafe(16)
            self.token_map[email] = token
        return self.token_map[email]
    
    def hash_pseudonymize(self, email: str) -> str:
        return hashlib.pbkdf2_hmac(
            'sha256',
            email.encode(),
            self.secret,
            100000
        ).hex()[:24]

This approach preserves referential integrity—you can still link records across datasets using the token while hiding the original identifier.

Database-Level Pseudonymization

For database systems, consider column-level encryption or masking:

-- PostgreSQL example using pgcrypto
CREATE TABLE users_pseudonymized (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email_hash TEXT NOT NULL,
    name_encrypted BYTEA,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Store encrypted instead of plain data
INSERT INTO users_pseudonymized (email_hash, name_encrypted)
VALUES (
    encode(pgcrypto.digest('user@example.com', 'sha256'), 'hex'),
    pgp_sym_encrypt('John Doe', 'encryption_key_here')
);

The original data exists only in your secure key management system. The database contains only transformed values.

Anonymization: Achieving Irreversibility

Anonymization must make re-identification impossible—even with access to all available information. This requires irreversible transformation techniques that preserve statistical utility while eliminating attribution.

K-Anonymity Implementation

A foundational technique ensures each record blends with at least k-1 others:

import pandas as pd
import numpy as np

def generalize_age(age: int, bucket_size: int = 10) -> str:
    lower = (age // bucket_size) * bucket_size
    return f"{lower}-{lower + bucket_size - 1}"

def generalize_zipcode(zipcode: str, digits: int = 3) -> str:
    return zipcode[:digits] + "".join(["0"] * (len(zipcode) - digits))

def apply_k_anonymity(df: pd.DataFrame, k: int = 5) -> pd.DataFrame:
    df = df.copy()
    df['age'] = df['age'].apply(lambda x: generalize_age(x, 10))
    df['zipcode'] = df['zipcode'].apply(lambda x: generalize_zipcode(x, 3))
    return df

This generalization ensures records cannot be uniquely identified by combining quasi-identifiers (age, zipcode, gender).

Differential Privacy for Aggregates

When computing statistics over datasets, differential privacy adds calibrated noise:

import numpy as np

def laplace_mechanism(true_value: float, epsilon: float) -> float:
    scale = 1.0 / epsilon
    noise = np.random.laplace(0, scale)
    return true_value + noise

def private_count(query_func, epsilon: float = 1.0) -> float:
    true_count = query_func()
    return laplace_mechanism(true_count, epsilon)

This guarantees that the presence or absence of any single record cannot be inferred from the output.

Key Differences at a Glance

Aspect Pseudonymization Anonymization
GDPR Status Personal data Not personal data
Reversibility Reversible with key Irreversible
Implementation Tokenization, encryption Generalization, aggregation
Data Utility High—maintains relationships Lower—statistical only
Use Case Analytics with reduced risk Public releases, research

When to Use Each Technique

Choose pseudonymization when you need to perform analytics while reducing exposure, maintain audit trails or transaction links, or enable data processing by third parties with reduced risk.

Choose anonymization when you need to share data publicly or with untrusted parties, publish research datasets, or comply with data minimization principles where later identification is unnecessary.

Implementation Considerations

Regardless of technique, consider these practical factors:

Pseudonymization requires secure key storage. Key compromise defeats the entire protection mechanism, so use hardware security modules (HSMs) or dedicated key management services.

When pseudonymizing across multiple tables, use consistent mapping to preserve relationships. An email-to-token mapping applied consistently maintains referential integrity.

Test anonymization by attempting re-identification using available auxiliary information. If records can be re-identified, the technique is insufficient.

Document pseudonymization keys and anonymization methods. This supports accountability requirements under GDPR Article 30.

Pseudonymization provides a practical balance for most developer scenarios—operational capability is maintained while data subject risk is demonstrably reduced. Anonymization becomes necessary when data leaves a controlled environment or when true data minimization is the goal. Apply both as layers in your data processing architecture based on your use case, regulatory requirements, and acceptable risk profile.

Built by theluckystrike — More at zovo.one