Pseudonymize data using deterministic encryption (same input always produces same output) to replace PII with tokens while maintaining relational integrity across tables. Store encryption keys separately from data to prevent re-identification if the database is breached. Under GDPR, pseudonymized data still requires security protections, but you can satisfy Article 32 requirements more easily, and data retention obligations become clearer since you can delete de-pseudonymization keys to permanently erase records.
Understanding Pseudonymization Under GDPR
GDPR explicitly recognizes pseudonymization in Article 4(5) as a processing safeguard. The regulation distinguishes between pseudonymized data (still considered personal data) and truly anonymized data (no longer personal data). This distinction matters because pseudonymized data remains subject to GDPR requirements, but the Article 32 security measures become significantly easier to satisfy.
The core principle involves separating direct identifiers from the data itself. When a database breach occurs, pseudonymized information provides minimal value to attackers since the meaningful identifiers are not present.
Pseudonymization vs. Anonymization
These terms are frequently confused, and the distinction carries significant legal weight:
| Property | Pseudonymization | Anonymization |
|---|---|---|
| Re-identification possible? | Yes, with key/mapping | No (irreversible) |
| Still personal data under GDPR? | Yes | No |
| Subject to GDPR? | Yes, but with reduced obligations | No |
| Useful for analytics? | Yes, with careful key management | Yes |
| Right to erasure compliant? | Yes, by deleting keys | Built-in |
True anonymization — where re-identification is irreversible — is extremely difficult to achieve in practice because datasets can often be re-identified through combination attacks. Pseudonymization is the pragmatic middle ground that GDPR explicitly endorses.
Lawful Basis Implications
Using pseudonymization can broaden what you are permitted to do with data. Recital 29 of GDPR states that applying pseudonymization to personal data can reduce the risks to the data subjects and help controllers and processors meet their data protection obligations. Practically, this means pseudonymized data is more defensible when used for secondary purposes such as internal analytics, fraud detection model training, or cross-team data sharing.
Database-Level Pseudonymization Techniques
Column-Level Encryption with Application Keys
The most straightforward approach involves encrypting sensitive columns using symmetric encryption. PostgreSQL, MySQL, and other database systems provide built-in encryption functions that work well for this purpose.
-- PostgreSQL example: Encrypting email column
ALTER TABLE users
ADD COLUMN email_encrypted BYTEA;
UPDATE users
SET email_encrypted = pgp_sym_encrypt(email, current_setting('app.key'));
For application-level encryption, you maintain complete control over keys:
import psycopg2
from cryptography.fernet import Fernet
class Pseudonymizer:
def __init__(self, key_path):
with open(key_path, 'rb') as f:
self.key = f.read()
self.cipher = Fernet(self.key)
def encrypt_value(self, plaintext):
return self.cipher.encrypt(plaintext.encode())
def decrypt_value(self, ciphertext):
return self.cipher.decrypt(ciphertext).decode()
Tokenization Through Reference Tables
Tokenization replaces sensitive values with randomly generated tokens stored in a separate mapping table. This approach provides excellent security because the token has no mathematical relationship to the original value.
CREATE TABLE token_mapping (
token_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
sensitive_data TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_token_id ON token_mapping(token_id);
-- Store token instead of actual data
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email_token UUID REFERENCES token_mapping(token_id),
name VARCHAR(255)
);
The mapping table should receive additional security protections including encryption at rest, restricted access, and audit logging.
Hash-Based Pseudonymization
For scenarios requiring consistency (such as analytics across datasets), cryptographic hashing with per-record salts provides pseudonymization while maintaining referential integrity:
import hashlib
import secrets
def pseudonymize_with_salt(value, salt):
"""Create consistent pseudonym using salted hash."""
combined = f"{value}{salt}".encode('utf-8')
return hashlib.sha256(combined).hexdigest()
def generate_salt():
"""Generate cryptographically random salt."""
return secrets.token_hex(16)
Store the salt alongside the hash for future re-identification:
ALTER TABLE users
ADD COLUMN email_pseudonym VARCHAR(64),
ADD COLUMN email_salt VARCHAR(32);
Note that hash-based pseudonymization is one-way without the salt. If you need to look up a user by their original email (for login, for example), you must either retain the salt and recompute the hash for comparison, or store the token mapping separately. Hash-based approaches work best for analytics use cases where you want to count or group by a pseudonymous identifier without ever needing to resolve it back to the original.
Key Management Considerations
Effective pseudonymization relies on proper key management. Keys should never be stored alongside encrypted data. Consider these practices:
Key Hierarchy: Use master keys to encrypt key-encrypting keys (KEKs), which then encrypt data-encryption keys (DEKs). This allows key rotation without re-encrypting entire databases.
Key Rotation: Implement automated key rotation schedules. Most security frameworks recommend rotating encryption keys annually at minimum, with more frequent rotation for highly sensitive data.
Key Storage: Store keys in dedicated hardware security modules (HSMs) or key management services such as AWS KMS, Google Cloud KMS, or HashiCorp Vault. Never commit keys to version control or store them in configuration files.
Key Separation Across Environments: Use entirely separate keys in development, staging, and production environments. Production keys must never exist in development environments. This prevents accidental exposure through developer tooling and log aggregation systems.
Implementation Patterns
On-Insert Pseudonymization
Handle pseudonymization at the application layer during data insertion:
def create_user(db_connection, email, name):
pseudonymizer = get_pseudonymizer()
# Store original in token mapping
token_id = store_token(email)
# Insert pseudonymized record
cursor = db_connection.cursor()
cursor.execute(
"INSERT INTO users (email_token, name) VALUES (%s, %s)",
(token_id, name)
)
db_connection.commit()
Batch Pseudonymization for Existing Data
When pseudonymizing existing databases, use transactional updates:
def pseudonymize_existing_users(db_connection):
pseudonymizer = get_pseudonymizer()
cursor = db_connection.cursor()
cursor.execute("SELECT id, email FROM users WHERE email_token IS NULL")
batch_size = 1000
while True:
rows = cursor.fetchmany(batch_size)
if not rows:
break
for user_id, email in rows:
token_id = pseudonymizer.create_token(email)
cursor.execute(
"UPDATE users SET email_token = %s WHERE id = %s",
(token_id, user_id)
)
db_connection.commit()
print(f"Processed {len(rows)} records")
Run batch jobs during low-traffic windows and monitor for lock contention on large tables. On PostgreSQL, consider using SELECT ... FOR UPDATE SKIP LOCKED to safely parallelize the batch job across multiple workers.
Handling the Right to Erasure
GDPR Article 17 grants data subjects the right to request erasure of their personal data. Pseudonymization makes this significantly easier to implement technically: delete the mapping entry (or the encryption key) and the pseudonymized data in your main tables becomes effectively unresolvable.
For tokenization implementations:
-- Erase a user's personal data while retaining their records
DELETE FROM token_mapping WHERE token_id = (
SELECT email_token FROM users WHERE id = :user_id
);
-- The users row remains; email_token now references a deleted mapping
-- No re-identification is possible
Document this erasure pattern in your Records of Processing Activities (RoPA) required under GDPR Article 30. Data protection authorities expect to see a clear procedure for handling erasure requests, and a pseudonymization-based approach is straightforward to describe and audit.
Testing Your Implementation
Verify pseudonymization effectiveness through these validation steps:
Data Integrity: Confirm that original values can be recovered when using the correct key:
def verify_pseudonymization(user_id):
cursor.execute("SELECT email_token FROM users WHERE id = %s", (user_id,))
token_id = cursor.fetchone()[0]
original_email = retrieve_token(token_id)
return original_email is not None
Security Testing: Attempt re-identification using compromised credentials or database access to ensure pseudonymized values remain protected. Specifically, test what an attacker who has read access to the main users table but not the token_mapping table can learn. They should see only UUIDs with no path to the original PII.
Audit Logging Verification: Confirm that access to the token mapping table is logged. Any query against the mapping table represents a de-pseudonymization event and should appear in your audit trail for later review.
Related Articles
- Gdpr Pseudonymization Vs Anonymization Explained
- Gdpr Penalties Fines Database Case Examples
- Implement Data Portability Feature For Customers Gdpr Right
- How To Anonymize User Data In Production Database For Privac
- Macos Privacy Permissions Explained Which Tcc Database.
Built by theluckystrike — More at zovo.one