The Staging Data Problem
Developers need realistic data to test effectively. The temptation is to copy production — but that puts real PII on development machines, often with weaker security.
Anonymizing with pgcomply
Preview First (Dry Run)
SELECT pgcomply.anonymize(
'users', 'email', 'email',
seed_column := 'user_id',
dry_run := true
);
Shows what would change without modifying data:
user_id | email (before) | email (after)
----------+----------------------+-----------------------------
user-001 | alice@example.com | user_a3f1b2c4@anonymous.invalid
user-002 | bob@company.de | user_e5d6f7a8@anonymous.invalid
Execute Anonymization
SELECT pgcomply.anonymize('users', 'email', 'email', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('users', 'name', 'person_name', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('profiles', 'street', 'address', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('profiles', 'phone', 'phone', seed_column := 'user_id', dry_run := false);
Verify No Real PII Remains
SELECT * FROM pgcomply.verify_forget('user-001');
-- All columns should show 'clean' or contain only fake data
Fake Data Generators
SELECT pgcomply.fake_email('seed-123'); -- user_a3f1b2c4@anonymous.invalid
SELECT pgcomply.fake_name('seed-123'); -- Clara S.
SELECT pgcomply.fake_phone('seed-123'); -- +49 150 1234567
SELECT pgcomply.fake_address('seed-123'); -- Musterstraße 42, 10115 Berlin
SELECT pgcomply.fake_ip('seed-123'); -- 10.142.87.204
Workflow: Production to Staging
# 1. Dump production
pg_dump production_db > prod_dump.sql
# 2. Restore to staging
psql staging_db < prod_dump.sql
# 3. Install pgcomply and anonymize
psql staging_db -f pgcomply.sql
psql staging_db -c "SELECT pgcomply.anonymize('users', 'email', 'email', seed_column := 'user_id', dry_run := false);"
# ... repeat for all PII columns
Building an Anonymization Script
For repeatable staging refreshes, create a script:
-- anonymize_staging.sql
-- Run after restoring production dump to staging
-- Install pgcomply if not present
\i pgcomply.sql
-- Anonymize all registered PII columns
DO $$
DECLARE
r RECORD;
BEGIN
FOR r IN SELECT table_name, column_name, pii_type
FROM pgcomply.pii_registry
WHERE schema_name = 'public'
LOOP
BEGIN
PERFORM pgcomply.anonymize(
r.table_name, r.column_name, r.pii_type,
seed_column := 'user_id',
dry_run := false
);
RAISE NOTICE 'Anonymized %.%', r.table_name, r.column_name;
EXCEPTION WHEN OTHERS THEN
RAISE WARNING 'Failed to anonymize %.%: %', r.table_name, r.column_name, SQLERRM;
END;
END LOOP;
END;
$$;
-- Verify no real PII remains
SELECT table_name, column_name, pii_type
FROM pgcomply.schema_drift()
WHERE confidence = 'high';
-- Reset sequences and clean up
VACUUM ANALYZE;
Automation: Nightly Staging Refresh
#!/bin/bash
# refresh_staging.sh — run via cron at 02:00 nightly
set -e
# 1. Fresh dump from production
pg_dump -Fc -h prod-db production > /tmp/staging_dump.dump
# 2. Restore to staging (drop and recreate)
dropdb --if-exists staging_db
createdb staging_db
pg_restore -d staging_db /tmp/staging_dump.dump
# 3. Anonymize
psql staging_db -f anonymize_staging.sql
# 4. Log success
echo "$(date): Staging refreshed and anonymized" >> /var/log/staging_refresh.log
# 5. Clean up
rm /tmp/staging_dump.dump
Handling Special Cases
Auto-increment IDs: These are not PII and should not be anonymized — they are needed for foreign key relationships.
Timestamps: Created_at and updated_at are generally not PII, but combined with other data they can be identifying. Decide per-table whether to jitter timestamps:
-- Optional: add random jitter to timestamps (±7 days)
UPDATE users SET created_at = created_at + (random() * 14 - 7) * INTERVAL '1 day';
JSON columns: If JSONB columns contain PII, anonymize the specific keys:
UPDATE profiles SET metadata = metadata - 'real_name' || jsonb_build_object('real_name', pgcomply.fake_name(id::text));
Summary
Anonymization bridges the gap between developer needs and compliance requirements. pgcomply's deterministic approach ensures realistic, consistent fake data that preserves referential integrity. No more choosing between useful test data and GDPR compliance.