Re: Re[2]: Possible solution for masking chosen columns when using pg_dump

From: Виктория Шепард <we(dot)viktory(at)gmail(dot)com>
To: Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re: Re[2]: Possible solution for masking chosen columns when using pg_dump
Date: 2022-10-12 07:19:34
Message-ID: CAAB=fdPuaNk_O9_dcZ6Nj0ZmgpBTKEiibdGo5sz_FAFOK+Rxfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here is an idea of how to read masking options from a file. Please, take a
look.

пн, 10 окт. 2022 г. в 14:54, Олег Целебровский <oleg_tselebrovskiy(at)mail(dot)ru>:

> Hi,
>
> I applied most of suggestions: used separate files for most of added code,
> fixed typos/mistakes, got rid of that pesky TODO that was already
> implemented, just not deleted.
>
> Added tests (and functionality) for cases when you need to mask columns in
> tables with the same name in different schemas. If schema is not specified,
> then columns in all tables with specified name are masked (Example -
> pg_dump -t ‘t0’ --mask-columns id --mask-function mask_int will mask all
> ids in all tables with names ‘t0’ in all existing schemas).
>
> Wrote comments for all ‘magic numbers’
>
> About that
>
> >- Also it can be hard to use a lot of different functions for different
> fields, maybe it would be better to set up functions in a file.
>
> I agree with that, but I know about at least 2 other patches (both are
> WIP, but still) that are interacting with reading command-line options from
> file. And if everyone will write their own version of reading command-line
> options from file, it will quickly get confusing.
>
> A solution to that problem is another patch that will put all options from
> file (one file for any possible options, from existing to future ones) into
> **argv in main, so that pg_dump can process them as if they came form
> command line.
>
>
> Пятница, 7 октября 2022, 8:01 +07:00 от Виктория Шепард <
> we(dot)viktory(at)gmail(dot)com>:
>
> Hi,
> I took a look, here are several suggestions for improvement:
>
> - Masking is not a main functionality of pg_dump and it is better to write
> most of the connected things in a separate file like parallel.c or
> dumputils.c. This will help slow down the growth of an already huge pg_dump
> file.
>
> - Also it can be hard to use a lot of different functions for different
> fields, maybe it would be better to set up functions in a file.
>
> - How will it work for the same field and tables in the different schemas?
> Can we set up the exact schema for the field?
>
> - misspelling in a word
> >/*
> >* Add all columns and funcions to list of MaskColumnInfo structures,
> >*/
>
> - Why did you use 256 here?
> > char* table = (char*) pg_malloc(256 * sizeof(char));
> Also for malloc you need malloc on 1 symbol more because you have to store
> '\0' symbol.
>
> - Instead of addFuncToDatabase you can run your query using something
> already defined from fe_utils/query_utils.c. And It will be better to set
> up a connection only once and create all functions. Establishing a
> connection is a resource-intensive procedure. There are a lot of magic
> numbers, better to leave some comments explaining why there are 64 or 512.
>
> - It seems that you are not using temp_string
> > char *temp_string = (char*)malloc(256 * sizeof(char));
>
> - Grammar issues
> >/*
> >* mask_column_info_list contains info about every to-be-masked column:
> >* its name, a name its table (if nothing is specified - mask all columns
> with this name),
> >* name of masking function and name of schema containing this function
> (public if not specified)
> >*/
> the name of its table
>
>
> пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju123(at)gmail(dot)com
> <//e.mail.ru/compose/?mailto=mailto%3arjuju123(at)gmail(dot)com>>:
>
> Hi,
>
> On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote:
> >
> > Hello, here's my take on masking data when using pg_dump
> >
> > The main idea is using PostgreSQL functions to replace data during a
> SELECT.
> > When table data is dumped SELECT a,b,c,d ... from ... query is
> generated, the columns that are marked for masking are replaced with result
> of functions on those columns
> > Example: columns name, count are to be masked, so the query will look as
> such: SELECT id, mask_text(name), mask_int(count), date from ...
> >
> > So about the interface: I added 2 more command-line options:
> >
> > --mask-columns, which specifies what columns from what tables will be
> masked
> > usage example:
> > --mask-columns "t1.name, t2.description" - both columns
> will be masked with the same corresponding function
> > or --mask-columns name - ALL columns with name "name" from
> all dumped tables will be masked with correspoding function
> >
> > --mask-function, which specifies what functions will mask data
> > usage example:
> > --mask-function mask_int - corresponding columns will be
> masked with function named "mask_int" from default schema (public)
> > or --mask-function my_schema.mask_varchar - same as above
> but with specified schema where the function is stored
> > or --mask-function somedir/filename - the function is
> "defined" here - more on the structure below
>
> FTR I wrote an extension POC [1] last weekend that does that but on the
> backend
> side. The main advantage is that it's working with any existing versions
> of
> pg_dump (or any client relying on COPY or even plain interactive SQL
> statements), and that the DBA can force a dedicated role to only get a
> masked
> dump, even if they forgot to ask for it.
>
> I only had a quick look at your patch but it seems that you left some todo
> in
> russian, which isn't helpful at least to me.
>
> [1] https://github.com/rjuju/pg_anonymize
>
>
>
>
>

Attachment Content-Type Size
0001-C4a-pg-dump-masking-option.patch text/x-patch 65.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-10-12 07:19:54 Re: test_decoding assertion failure for the loss of top-sub transaction relationship
Previous Message Masahiko Sawada 2022-10-12 07:15:19 Re: Add index scan progress to pg_stat_progress_vacuum