Eymoutiers, France, Mars 5, 2020
PostgreSQL Anonymizer
is an extension that hides or replaces personally
identifiable information (PII) or commercially sensitive data from a PostgreSQL
database.
The extension supports 3 different anonymization strategies: Dynamic Masking, In-Place Anonymization and Anonymous Dumps. It also offers a large choice of Masking Functions: Substitution, Randomization, Faking, Pseudonymization, Partial Scrambling, Shuffling, Noise Addition and Generalization.
Pseudonymization functions are similar to the faking functions in the sense that they generate realistic values. The main difference is that pseudonymization is deterministic : the functions will always return the same fake value based on a seed and an optional salt.
# SELECT anon.pseudo_email('Alice','salt123');
pseudo_email
-------------------
fcadell56@ucoz.ru
# SELECT anon.pseudo_email('Alice','salt123');
pseudo_email
-------------------
fcadell56@ucoz.ru
WARNING : Pseudonymization is often confused with anonymization but in fact
they serve 2 different purposes. With pseudonymization, the real data can be
rebuilt using the pseudo data and the masking rule. If an attacker
gets access to these elements, he or she can easily re-identify some people
using brute force
or dictionary
attacks. Therefore, you should protect any
pseudonymized data with the same level of security that the original dataset.
The GDPR makes it very clear that personal data which have undergone
pseudonymization are still considered to be personal information.
The anon.dump()
function was slow and unpractical. It is now deprecated and
replace by a tool named pg_dump_anon
that you can use like the regular pg_dump
command:
$ pg_dump_anon -h localhost -U bob mydb > anonymous_dump.sql
It uses the same connections parameters that pg_dump
. The PostgreSQL
environment variables ($PGHOST, PGUSER, etc.) and .pgpass
are supported.
However the plain
format is the only supported format. The other formats
(custom
, dir
and tar
) are not supported.
This extension makes it very easy to declare masking rules. But of course when you're creating an anonymization strategy, the hard part is to scan the database model to find which columns contains direct and indirect identifiers and then decide how these identifiers should be masked.
We now provide a detect()
function that will search for common identifiers
names based on a dictionary. For now, 2 dictionaries are available: English
('en_US') and French ('fr_FR'). By default the English dictionary is used:
# SELECT anon.detect('en_US');
table_name | column_name | identifiers_category | direct
------------+----------------+----------------------+--------
customer | CreditCard | creditcard | t
customer | id | account_id | t
vendor | Firstname | firstname | t
The identifier categories are based on the HIPAA classification.
This extension is officially supported on PostgreSQL 9.6 and later.
On Red Hat / CentOS systems, you can install it from the official PostgreSQL RPM repository:
$ yum install postgresql_anonymizer12
Then add 'anon' in the shared_preload_libraries
parameter of your
postgresql.conf
file. And restart your instance.
For other systems, check out the install documentation :
https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/
WARNING: The project is at an early stage of development and should be used carefully.
This release includes code, bugfixes and ideas from Sebastien Delobel, Sam Buckingham, Thomas Clark, Joe Auty, Pierre-Henri Dubois Amy and Olleg Samoylov.
Many thanks to them !
PostgreSQL Anonymizer is part of the Dalibo Labs initiative. It is mainly developed by Damien Clochard.
This is an open project, contributions are welcome. We need your feedback and ideas ! Let us know what you think of this tool, how it fits your needs and what features are missing.
If you want to help, you can find a list of Junior Jobs.