Re: New "raw" COPY format

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-24 00:54:39
Message-ID: CAD21AoAZh5ZwX6dsjFqTiYZm3CM3g0qKjBixSsNv-R0DtveXFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sat, Oct 19, 2024 at 8:33 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Sat, Oct 19, 2024, at 12:13, jian he wrote:
> > We already make RAW and can only have one column.
> > if RAW has no default delimiter, then COPY FROM a text file will
> > become one datum value;
> > which makes it looks like importing a Large Object.
> > (https://www.postgresql.org/docs/17/lo-funcs.html)
>
> The single datum value might not come from a physical column; it could be
> an aggregated JSON value, as in the example Daniel mentioned:
>
> On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote:
> > copy (select json_agg(col) from table ) to 'file' RAW
> >
> > This is a variant of the discussion in [1] where the OP does:
> >
> > copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'
> >
> > and he complains that both text and csv "break the JSON".
> > That discussion morphed into a proposed patch adding JSON
> > format to COPY, but RAW would work directly as the OP
> > expected.
> >
> > That is, unless <query> happens to include JSON fields with LF/CRLF
> > in them, and the RAW format says this is an error condition.
> > In that case it's quite annoying to make it an error, rather than
> > simply let it pass.
> >
> > [1]
> > https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com
>
> In such cases, a user could perform the following:
>
> CREATE TABLE customers (id int, name text, email text);
>
> INSERT INTO customers (id, name, email) VALUES
> (1, 'John Doe', 'john(dot)doe(at)example(dot)com'),
> (2, 'Jane Smith', 'jane(dot)smith(at)example(dot)com'),
> (3, 'Alice Johnson', 'alice(dot)johnson(at)example(dot)com');
>
> COPY (SELECT json_agg(row_to_json(t)) FROM customers t) TO '/tmp/file' (FORMAT raw);
>
> % cat /tmp/file
> [{"id":1,"name":"John Doe","email":"john(dot)doe(at)example(dot)com"}, {"id":2,"name":"Jane Smith","email":"jane(dot)smith(at)example(dot)com"}, {"id":3,"name":"Alice Johnson","email":"alice(dot)johnson(at)example(dot)com"}]%
>
> > i think, most of the time, you have more than one row/value to import
> > and export?
>
> Yes, probably, but it might not be a physical row. It could be an aggregated
> one, like in the example above. When importing, it might be a large JSON array
> of objects that is imported into a temporary table and then deserialized into
> a proper schema.
>
> The need to load entire files is already fulfilled by pg_read_file(text) -> text,
> but there is no pg_write_file(), likely for security reasons.
> So COPY TO ... (FORMAT RAW) with no delimiter seems necessary,
> and then COPY FROM also needs to work accordingly.
>
> >> The refactoring is now in a separate first single commit, which seems
> >> necessary, to separate the new functionality, from the refactoring.
> > I agree.
>
> > ProcessCopyOptions
> > /* Extract options from the statement node tree */
> > foreach(option, options)
> > {
> > }
> > /* --- DELIMITER option --- */
> > /* --- NULL option --- */
> > /* --- QUOTE option --- */
> > Currently the regress test passed, i think that means your refactor is fine.
>
> I believe that a passing test indicates it might be okay,
> but a failing test definitely means it's not. :D
>
> I've meticulously refactored one option at a time, checking which code in
> ProcessCopyOptions depends on each option field to ensure the semantics
> are preserved.
>
> I think the changes are easy to follow, and it's clear that each change is
> correct when looking at them individually, though it might be more challenging
> when viewing the total change.
>
> I've tried to minimize code movement, preserving as much of the original
> code placement as possible.
>
> > in ProcessCopyOptions, maybe we can rearrange the code after the
> > foreach loop (foreach(option, options)
> > based on the parameters order in
> > https://www.postgresql.org/docs/devel/sql-copy.html Parameters section.
> > so we can review it by comparing the refactoring with the
> > sql-copy.html Parameters section's description.
>
> That would be nice, but unfortunately, it's not possible because the order of
> the option code blocks matters due to the setting of defaults in else/else
> if branches when an option is not specified.
>
> For example, in the documentation, DEFAULT precedes QUOTE,
> but in ProcessCopyOptions, the QUOTE code block must come before
> the DEFAULT code block due to the check:
>
> /* Don't allow the CSV quote char to appear in the default string. */
>
> I also believe there's value in minimizing code movement.
>
>
> >> > We already did column length checking at BeginCopyTo.
> >> > no need to "if (list_length(cstate->attnumlist) != 1)" error check in
> >> > CopyOneRowTo?
> >>
> >> Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
> >> and DoCopyTo() which in turn calls CopyOneRowTo(),
> >> but CopyOneRowTo() is also being called from copy_dest_receive().
> >>
> >
> > BeginCopyTo do the preparation work.
> > cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist);
> >
> > After CopyGetAttnums, the number of attributes for COPY TO cannot be changed.
> > right after CopyGetAttnums call then check the length of cstate->attnumlist
> > seems fine for me.
> > I think in CopyOneRowTo, we can actually
> > Assert(list_length(cstate->attnumlist) == 1).
> > for raw format.
>
> Right, I've changed it to an Assert instead.
>
> > src10=# drop table if exists x;
> > create table x(a int);
> > COPY x from stdin (FORMAT raw);
> > DROP TABLE
> > CREATE TABLE
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself, or an EOF signal.
> >>> 11
> >>> 12
> >>> \.
> > ERROR: invalid input syntax for type integer: "11
> > 12
> > "
> > CONTEXT: COPY x, line 1, column a: "11
> > 12
> > "
> >
> > The above case means COPY FROM STDIN (FORMAT RAW) can only import one
> > single value (when successful).
> > user need to specify like:
> >
> > COPY x from stdin (FORMAT raw, delimiter E'\n');
> >
> > seems raw format default no delimiter is not user friendly.
>
> I have no idea if dealing with .json files that would contain newlines
> in between fields, and would therefore need to be imported "as is",
> is more common than dealing with e.g. .jsonl files where it's guaranteed
> each json value is on a single line.
>
> I think Jacob raised some valid concerns on automagically detecting
> newlines, that is how text/csv works, so I don't think we want that.
>
> Maybe the OS default EOL would be an OK default,
> if we want it to be the default delimiter, that is.
>
> I have no strong opinion here, except automagical newline detection seems
> like a bad idea.
>
> I'm fine with OS default EOL as the default for the delimiter,
> or no delimiter as the default.
>
> New patch attached.

I have one question:

From the 0001 patch's commit message:

No behavioral changes are intended; this is a pure refactoring to improve code
clarity and maintainability.

Does the reorganization of the option validation done by this patch
also help make the 0002 patch simple or small? If not much, while it
makes sense to me that introducing the CopyFormat enum is required by
the 0002 patch, I think we can discuss the reorganization part
separately. And I'd suggest the patch organization would be:

0001: introduce CopyFormat and replace csv_mode and binary fields with it.
0002: add new 'raw' format.
0003: reorganize option validations.

One benefit would be that we can remove the simple replacements like
'->binary' with '.format == COPY_FORMAT_BINARY' from the
reorganization patch, which makes it small. The 0001 and 0002 patches
that seem to be more straightforward are independent from the 0003
patch, and we can discuss how to make the option validations including
the new 'raw' format better.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthew Morrissette Vance 2024-10-24 01:27:59 Commutation of array SOME/ANY and ALL operators
Previous Message Michael Paquier 2024-10-23 23:36:07 Re: Set query_id for query contained in utility statement