Privacy Tools Guide

Pseudonymize data using deterministic encryption (same input always produces same output) to replace PII with tokens while maintaining relational integrity across tables. Store encryption keys separately from data to prevent re-identification if the database is breached. Under GDPR, pseudonymized data still requires security protections, but you can satisfy Article 32 requirements more easily, and data retention obligations become clearer since you can delete de-pseudonymization keys to permanently erase records.

Understanding Pseudonymization Under GDPR

GDPR explicitly recognizes pseudonymization in Article 4(5) as a processing safeguard. The regulation distinguishes between pseudonymized data (still considered personal data) and truly anonymized data (no longer personal data). This distinction matters because pseudonymized data remains subject to GDPR requirements, but the Article 32 security measures become significantly easier to satisfy.

The core principle involves separating direct identifiers from the data itself. When a database breach occurs, pseudonymized information provides minimal value to attackers since the meaningful identifiers are not present.

Pseudonymization vs. Anonymization

These terms are frequently confused, and the distinction carries significant legal weight:

Property Pseudonymization Anonymization
Re-identification possible? Yes, with key/mapping No (irreversible)
Still personal data under GDPR? Yes No
Subject to GDPR? Yes, but with reduced obligations No
Useful for analytics? Yes, with careful key management Yes
Right to erasure compliant? Yes, by deleting keys Built-in

True anonymization — where re-identification is irreversible — is extremely difficult to achieve in practice because datasets can often be re-identified through combination attacks. Pseudonymization is the pragmatic middle ground that GDPR explicitly endorses.

Lawful Basis Implications

Using pseudonymization can broaden what you are permitted to do with data. Recital 29 of GDPR states that applying pseudonymization to personal data can reduce the risks to the data subjects and help controllers and processors meet their data protection obligations. Practically, this means pseudonymized data is more defensible when used for secondary purposes such as internal analytics, fraud detection model training, or cross-team data sharing.

Database-Level Pseudonymization Techniques

Column-Level Encryption with Application Keys

The most straightforward approach involves encrypting sensitive columns using symmetric encryption. PostgreSQL, MySQL, and other database systems provide built-in encryption functions that work well for this purpose.

-- PostgreSQL example: Encrypting email column
ALTER TABLE users
ADD COLUMN email_encrypted BYTEA;

UPDATE users
SET email_encrypted = pgp_sym_encrypt(email, current_setting('app.key'));

For application-level encryption, you maintain complete control over keys:

import psycopg2
from cryptography.fernet import Fernet

class Pseudonymizer:
    def __init__(self, key_path):
        with open(key_path, 'rb') as f:
            self.key = f.read()
        self.cipher = Fernet(self.key)

    def encrypt_value(self, plaintext):
        return self.cipher.encrypt(plaintext.encode())

    def decrypt_value(self, ciphertext):
        return self.cipher.decrypt(ciphertext).decode()

Tokenization Through Reference Tables

Tokenization replaces sensitive values with randomly generated tokens stored in a separate mapping table. This approach provides excellent security because the token has no mathematical relationship to the original value.

CREATE TABLE token_mapping (
    token_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sensitive_data TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_token_id ON token_mapping(token_id);

-- Store token instead of actual data
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email_token UUID REFERENCES token_mapping(token_id),
    name VARCHAR(255)
);

The mapping table should receive additional security protections including encryption at rest, restricted access, and audit logging.

Hash-Based Pseudonymization

For scenarios requiring consistency (such as analytics across datasets), cryptographic hashing with per-record salts provides pseudonymization while maintaining referential integrity:

import hashlib
import secrets

def pseudonymize_with_salt(value, salt):
    """Create consistent pseudonym using salted hash."""
    combined = f"{value}{salt}".encode('utf-8')
    return hashlib.sha256(combined).hexdigest()

def generate_salt():
    """Generate cryptographically random salt."""
    return secrets.token_hex(16)

Store the salt alongside the hash for future re-identification:

ALTER TABLE users
ADD COLUMN email_pseudonym VARCHAR(64),
ADD COLUMN email_salt VARCHAR(32);

Note that hash-based pseudonymization is one-way without the salt. If you need to look up a user by their original email (for login, for example), you must either retain the salt and recompute the hash for comparison, or store the token mapping separately. Hash-based approaches work best for analytics use cases where you want to count or group by a pseudonymous identifier without ever needing to resolve it back to the original.

Key Management Considerations

Effective pseudonymization relies on proper key management. Keys should never be stored alongside encrypted data. Consider these practices:

Key Hierarchy: Use master keys to encrypt key-encrypting keys (KEKs), which then encrypt data-encryption keys (DEKs). This allows key rotation without re-encrypting entire databases.

Key Rotation: Implement automated key rotation schedules. Most security frameworks recommend rotating encryption keys annually at minimum, with more frequent rotation for highly sensitive data.

Key Storage: Store keys in dedicated hardware security modules (HSMs) or key management services such as AWS KMS, Google Cloud KMS, or HashiCorp Vault. Never commit keys to version control or store them in configuration files.

Key Separation Across Environments: Use entirely separate keys in development, staging, and production environments. Production keys must never exist in development environments. This prevents accidental exposure through developer tooling and log aggregation systems.

Implementation Patterns

On-Insert Pseudonymization

Handle pseudonymization at the application layer during data insertion:

def create_user(db_connection, email, name):
    pseudonymizer = get_pseudonymizer()

    # Store original in token mapping
    token_id = store_token(email)

    # Insert pseudonymized record
    cursor = db_connection.cursor()
    cursor.execute(
        "INSERT INTO users (email_token, name) VALUES (%s, %s)",
        (token_id, name)
    )
    db_connection.commit()

Batch Pseudonymization for Existing Data

When pseudonymizing existing databases, use transactional updates:

def pseudonymize_existing_users(db_connection):
    pseudonymizer = get_pseudonymizer()

    cursor = db_connection.cursor()
    cursor.execute("SELECT id, email FROM users WHERE email_token IS NULL")

    batch_size = 1000
    while True:
        rows = cursor.fetchmany(batch_size)
        if not rows:
            break

        for user_id, email in rows:
            token_id = pseudonymizer.create_token(email)
            cursor.execute(
                "UPDATE users SET email_token = %s WHERE id = %s",
                (token_id, user_id)
            )

        db_connection.commit()
        print(f"Processed {len(rows)} records")

Run batch jobs during low-traffic windows and monitor for lock contention on large tables. On PostgreSQL, consider using SELECT ... FOR UPDATE SKIP LOCKED to safely parallelize the batch job across multiple workers.

Handling the Right to Erasure

GDPR Article 17 grants data subjects the right to request erasure of their personal data. Pseudonymization makes this significantly easier to implement technically: delete the mapping entry (or the encryption key) and the pseudonymized data in your main tables becomes effectively unresolvable.

For tokenization implementations:

-- Erase a user's personal data while retaining their records
DELETE FROM token_mapping WHERE token_id = (
    SELECT email_token FROM users WHERE id = :user_id
);

-- The users row remains; email_token now references a deleted mapping
-- No re-identification is possible

Document this erasure pattern in your Records of Processing Activities (RoPA) required under GDPR Article 30. Data protection authorities expect to see a clear procedure for handling erasure requests, and a pseudonymization-based approach is straightforward to describe and audit.

Testing Your Implementation

Verify pseudonymization effectiveness through these validation steps:

Data Integrity: Confirm that original values can be recovered when using the correct key:

def verify_pseudonymization(user_id):
    cursor.execute("SELECT email_token FROM users WHERE id = %s", (user_id,))
    token_id = cursor.fetchone()[0]

    original_email = retrieve_token(token_id)
    return original_email is not None

Security Testing: Attempt re-identification using compromised credentials or database access to ensure pseudonymized values remain protected. Specifically, test what an attacker who has read access to the main users table but not the token_mapping table can learn. They should see only UUIDs with no path to the original PII.

Audit Logging Verification: Confirm that access to the token mapping table is logged. Any query against the mapping table represents a de-pseudonymization event and should appear in your audit trail for later review.

Built by theluckystrike — More at zovo.one