tutorials
postgresqlanonymizationstagingdevelopment

Anonymizing PostgreSQL Data for Staging and Development Environments

Create realistic but anonymous copies of production data for development. Deterministic anonymization preserves referential consistency across tables.

RL
Robert Langner
Managing Director, NILS Software GmbH · · 3 min read

The Staging Data Problem

Developers need realistic data to test effectively. The temptation is to copy production — but that puts real PII on development machines, often with weaker security.

Anonymizing with pgcomply

Preview First (Dry Run)

SELECT pgcomply.anonymize(
  'users', 'email', 'email',
  seed_column := 'user_id',
  dry_run := true
);

Shows what would change without modifying data:

 user_id  | email (before)       | email (after)
----------+----------------------+-----------------------------
 user-001 | alice@example.com    | user_a3f1b2c4@anonymous.invalid
 user-002 | bob@company.de       | user_e5d6f7a8@anonymous.invalid

Execute Anonymization

SELECT pgcomply.anonymize('users', 'email', 'email', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('users', 'name', 'person_name', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('profiles', 'street', 'address', seed_column := 'user_id', dry_run := false);
SELECT pgcomply.anonymize('profiles', 'phone', 'phone', seed_column := 'user_id', dry_run := false);

Verify No Real PII Remains

SELECT * FROM pgcomply.verify_forget('user-001');
-- All columns should show 'clean' or contain only fake data

Fake Data Generators

SELECT pgcomply.fake_email('seed-123');   -- user_a3f1b2c4@anonymous.invalid
SELECT pgcomply.fake_name('seed-123');    -- Clara S.
SELECT pgcomply.fake_phone('seed-123');   -- +49 150 1234567
SELECT pgcomply.fake_address('seed-123'); -- Musterstraße 42, 10115 Berlin
SELECT pgcomply.fake_ip('seed-123');      -- 10.142.87.204

Workflow: Production to Staging

# 1. Dump production
pg_dump production_db > prod_dump.sql

# 2. Restore to staging
psql staging_db < prod_dump.sql

# 3. Install pgcomply and anonymize
psql staging_db -f pgcomply.sql
psql staging_db -c "SELECT pgcomply.anonymize('users', 'email', 'email', seed_column := 'user_id', dry_run := false);"
# ... repeat for all PII columns

Building an Anonymization Script

For repeatable staging refreshes, create a script:

-- anonymize_staging.sql
-- Run after restoring production dump to staging

-- Install pgcomply if not present
\i pgcomply.sql

-- Anonymize all registered PII columns
DO $$
DECLARE
  r RECORD;
BEGIN
  FOR r IN SELECT table_name, column_name, pii_type
           FROM pgcomply.pii_registry
           WHERE schema_name = 'public'
  LOOP
    BEGIN
      PERFORM pgcomply.anonymize(
        r.table_name, r.column_name, r.pii_type,
        seed_column := 'user_id',
        dry_run := false
      );
      RAISE NOTICE 'Anonymized %.%', r.table_name, r.column_name;
    EXCEPTION WHEN OTHERS THEN
      RAISE WARNING 'Failed to anonymize %.%: %', r.table_name, r.column_name, SQLERRM;
    END;
  END LOOP;
END;
$$;

-- Verify no real PII remains
SELECT table_name, column_name, pii_type
FROM pgcomply.schema_drift()
WHERE confidence = 'high';

-- Reset sequences and clean up
VACUUM ANALYZE;

Automation: Nightly Staging Refresh

#!/bin/bash
# refresh_staging.sh — run via cron at 02:00 nightly

set -e

# 1. Fresh dump from production
pg_dump -Fc -h prod-db production > /tmp/staging_dump.dump

# 2. Restore to staging (drop and recreate)
dropdb --if-exists staging_db
createdb staging_db
pg_restore -d staging_db /tmp/staging_dump.dump

# 3. Anonymize
psql staging_db -f anonymize_staging.sql

# 4. Log success
echo "$(date): Staging refreshed and anonymized" >> /var/log/staging_refresh.log

# 5. Clean up
rm /tmp/staging_dump.dump

Handling Special Cases

Auto-increment IDs: These are not PII and should not be anonymized — they are needed for foreign key relationships.

Timestamps: Created_at and updated_at are generally not PII, but combined with other data they can be identifying. Decide per-table whether to jitter timestamps:

-- Optional: add random jitter to timestamps (±7 days)
UPDATE users SET created_at = created_at + (random() * 14 - 7) * INTERVAL '1 day';

JSON columns: If JSONB columns contain PII, anonymize the specific keys:

UPDATE profiles SET metadata = metadata - 'real_name' || jsonb_build_object('real_name', pgcomply.fake_name(id::text));

Summary

Anonymization bridges the gap between developer needs and compliance requirements. pgcomply's deterministic approach ensures realistic, consistent fake data that preserves referential integrity. No more choosing between useful test data and GDPR compliance.

Frequently Asked Questions

Is it legal to use production data in staging?
No. Under GDPR, any processing of personal data requires a legal basis. Copying production PII to a staging environment is processing, and 'developer convenience' is not a legal basis. You must either anonymize the data or use synthetic test data.
What is deterministic anonymization?
Deterministic anonymization means the same input always produces the same fake output. If alice@example.com becomes user_a3f1@anonymous.invalid in the users table, it becomes the same address in the orders table. This preserves foreign key relationships and JOIN consistency.
How realistic is the fake data?
pgcomply generates German-format fake data: names from a 20-name pool (Clara S., Max B., etc.), addresses with German street formats and real postal codes, phone numbers with +49 prefix, and emails in a recognizable user_hash@anonymous.invalid pattern.

Related Articles