From: | Janning Vygen <vygen(at)kicktipp(dot)de> |
---|---|
To: | Bill Moran <wmoran(at)potentialtech(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Anonymized database dumps |
Date: | 2012-03-19 14:45:28 |
Message-ID: | 3EC70397-81C3-42E3-B29E-69F8A9A2C084@kicktipp.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Am 19.03.2012 um 13:22 schrieb Bill Moran <wmoran(at)potentialtech(dot)com>:
> In response to Janning Vygen <vygen(at)kicktipp(dot)de>:
>>
>> I am working on postgresql 9.1 and loving it!
>>
>> Sometimes we need a full database dump to test some performance issues
>> with real data.
>>
>> Of course we don't like to have sensible data like bunches of e-mail
>> addresses on our development machines as they are of no interest for
>> developers and should be kept secure.
>>
>> So we need an anonymized database dump. I thought about a few ways to
>> achieve this.
>>
>> 1. Best solution would be a special db user and some rules which fire on
>> reading some tables and replace privacy data with some random data. Now
>> doing a dump as this special user doesn't even copy the sensible data at
>> all. The user just has a different view on this database even when he
>> calls pg_dump.
>>
>> But as rules are not fired on COPY it can't work, right?
>>
>> 2. The other solution I can think of is something like
>>
>> pg_dump | sed > pgdump_anon
>>
>> where 'sed' does a lot of magical replace operations on the content of
>> the dump. I don't think this is going to work reliable.
>>
>> 3. More reliable would be to dump the database, restore it on a
>> different server, run some sql script which randomize some data, and
>> dump it again. hmm, seems to be the only reliable way so far. But it is
>> no fun when dumping and restoring takes an hour.
>>
>> Does anybody has a better idea how to achieve an anonymized database dump?
>
> I highly recommend #3. It's how we do it where I work.
>
> At first it seems like a big, slow, complicated monster, but once you've
> built the tools and have it running reliably it's very nice. Our system
> does the dumps overnight via cron (we have over 100 production databases)
> then changes the sensitive data, as well changing all the passwords to
> "password" so developers can easily log in as any account. During the
> day, the developers have access to all the sanitized dump files and can
> use them to make as many testing databases as they need. Yes, the data
> gets up to 24 hours out of date, but it's never been a problem for us.
Thanks for your response and your insights to your process. Sounds reasonable.
Regards
Janning
From | Date | Subject | |
---|---|---|---|
Next Message | Florent THOMAS | 2012-03-19 15:28:10 | Conditionnal validation for transaction |
Previous Message | Tom Lane | 2012-03-19 14:28:38 | Re: WHERE IN (subselect) versus WHERE IN (1,2,3,) |