Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data

From: Carter Thaxton <carter(dot)thaxton(at)gmail(dot)com>
To: Euler Taveira <euler(at)timbira(dot)com(dot)br>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data
Date: 2018-05-23 04:57:25
Message-ID: CAGiT_HMrELHpU2YVM=o5VOv0Q6VyD2=NcVQ6xSrPy56ose_-Uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> How would you handle foreign keys? It seems easier to produce a dump
> that won't restore.
>

This proposal will not attempt to be smart about foreign keys or anything
like that. I don't believe that would even be expected.

> We have the existing options:
> > --include-table=table (and its -t synonym)
> > --exclude-table=table
> > --exclude-table-data=table
> >
> > I propose a new option:
> > --include-table-data-where=table:filter_clause
> >
> I remembered an old thread [1]. At that time pg_dump was not so
> decoupled from the backend. We are far from being decoupled in a way
> that someone can write his own pg_dump using only calls from a
> library. I'm not sure pg_dump is the right place to add another ETL
> parameter. We already have too much parameters that could break a
> restore (flexibility is always welcome but too much is not so good).
>

In general, I agree with your sentiment that we don't want too much
flexibility in this tool. However, this just seems like a very obvious
missing feature to me. I was frankly surprised that pg_dump didn't already
have it.

I've designed this feature so that it behaves like a more flexible version
between --exclude-table-data and --include-table. Instead of dumping the
schema and zero rows, or the schema and all of the rows, it dumps the
schema and some specific rows.

Providing "--include-table-data-where=table:false" behaves exactly like
--exclude-table-data, and "--include-table-data-where=table:true" behaves
exactly like --include-table.
It does no more or less to prevent a restore. Given that
--exclude-table-data already exists, this seems to introduce no new issues
with restore.

> > pg_dump --include-table-data-where=largetable:"created_at >=
> '2018-05-01'"
> > database_name
> >
> How would you check that that expression is correct?

The patch as already provided produces an error message and appropriate
exit code during the dump process, presenting the invalid SQL that is
produced as part of the WHERE clause.
I could see some value in refactoring it to provide error messages earlier
in the process, but it's actually not bad as is.

Every parameter could quote its value. It means that your parameter have to
> escape the
> quote in '2018-05-01'.

I don't understand. The double quotes in my example are bash shell
quotes. There is no special quote parsing in this patch. The single
quotes are part of the WHERE clause.
Note that pg_dump already uses getopt_long, so it's not required to use the
= symbol to separate option from its associated value. So, it would also
be fine to call as follows:

pg_dump --include-table-data-where "largetable:created_at >=
'2018-05-01'" database_name

Another problem is that your spec does not show
> us how you would handle tables like Foo.Bar or "foo:bar" (colon have
> to be escaped)?
>

Using a dot to separate the schema works just fine. My proposal uses the
same mechanism as --include-table, --exclude-table, and
--exclude-table-data. In fact, it even supports wildcards in those
patterns.

Your point about a colon in the table name is interesting. In all my years
of working with PostgreSQL and other databases, I've never encountered a
table name that contained a colon. Perhaps an escape character, like \:
could work. Is there another separator character you would suggest, which
is illegal in table names, but also intuitive as a separator? Maybe a
comma?

> > The filter_clause is used as the contents of a WHERE clause when querying
> > the data to generate the COPY statement produced by pg_dump.
> >
> You are forgetting about --inserts parameter. Could I use
> --include-table-data-where and --inserts?

Yes, the --inserts parameter works just fine. Perhaps I should have said
"the COPY statement or INSERT statements".

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Carter Thaxton 2018-05-23 05:18:32 Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data
Previous Message Thomas Munro 2018-05-23 04:56:16 Re: [sqlsmith] Unpinning error in parallel worker