Re[2]: Possible solution for masking chosen columns when using pg_dump

From: Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>
To: Виктория Шепард <we(dot)viktory(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re[2]: Possible solution for masking chosen columns when using pg_dump
Date: 2022-10-10 09:53:43
Message-ID: 1665395623.333547323@f393.i.mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi,

I applied most of suggestions: used separate files for most of added code, fixed typos/mistakes, got rid of that pesky TODO that was already implemented, just not deleted.

Added tests (and functionality) for cases when you need to mask columns in tables with the same name in different schemas. If schema is not specified, then columns in all tables with specified name are masked (Example - pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all ids in all tables with names ‘t0’ in all existing schemas).

Wrote comments for all ‘magic numbers’

About that

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

I agree with that, but I know about at least 2 other patches (both are WIP, but still) that are interacting with reading command-line options from file. And if everyone will write their own version of reading command-line options from file, it will quickly get confusing.

A solution to that problem is another patch that will put all options from file (one file for any possible options, from existing to future ones) into **argv in main, so that pg_dump can process them as if they came form command line.
 
>Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <we(dot)viktory(at)gmail(dot)com>:

>Hi,
>I took a look, here are several suggestions for improvement:

>- Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.

>- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file.

>- How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field?

>- misspelling in a word
>>/*
>>* Add all columns and funcions to list of MaskColumnInfo structures,
>>*/

>- Why did you use 256 here?
>> char* table = (char*) pg_malloc(256 * sizeof(char));
>Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol.

>- Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512.

>- It seems that you are not using temp_string
>> char   *temp_string = (char*)malloc(256 * sizeof(char));

>- Grammar issues
>>/*
>>* mask_column_info_list contains info about every to-be-masked column:
>>* its name, a name its table (if nothing is specified - mask all columns with this name),
>>* name of masking function and name of schema containing this function (public if not specified)
>>*/
>the name of its table
>   
>пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud < rjuju123(at)gmail(dot)com >:
>>Hi,
>>
>>On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
>>>
>>> Hello, here's my take on masking data when using pg_dump
>>>  
>>> The main idea is using PostgreSQL functions to replace data during a SELECT.
>>> When table data is dumped SELECT a,b,c,d ... from ... query is generated, the columns that are marked for masking are replaced with result of functions on those columns
>>> Example: columns name, count are to be masked, so the query will look as such: SELECT id, mask_text(name), mask_int(count), date from ...
>>>  
>>> So about the interface: I added 2 more command-line options: 
>>>  
>>> --mask-columns, which specifies what columns from what tables will be masked 
>>>     usage example:
>>>             --mask-columns " t1.name , t2.description" - both columns will be masked with the same corresponding function
>>>             or --mask-columns name - ALL columns with name "name" from all dumped tables will be masked with correspoding function
>>>  
>>> --mask-function, which specifies what functions will mask data
>>>     usage example:
>>>             --mask-function mask_int - corresponding columns will be masked with function named "mask_int" from default schema (public)
>>>             or --mask-function my_schema.mask_varchar - same as above but with specified schema where the function is stored
>>>             or --mask-function somedir/filename - the function is "defined" here - more on the structure below
>>
>>FTR I wrote an extension POC [1] last weekend that does that but on the backend
>>side.  The main advantage is that it's working with any existing versions of
>>pg_dump (or any client relying on COPY or even plain interactive SQL
>>statements), and that the DBA can force a dedicated role to only get a masked
>>dump, even if they forgot to ask for it.
>>
>>I only had a quick look at your patch but it seems that you left some todo in
>>russian, which isn't helpful at least to me.
>>
>>[1] https://github.com/rjuju/pg_anonymize
>>
>> 
 

Attachment Content-Type Size
masking_for_pg_dump_v2.patch text/x-diff 25.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2022-10-10 09:58:48 Re: Support logical replication of DDLs
Previous Message Bharath Rupireddy 2022-10-10 09:52:15 Re: Suppressing useless wakeups in walreceiver