Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-19 15:32:49
Message-ID: ddbbffb5-613c-4435-8366-3d21a5e8965f@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 19, 2024, at 12:13, jian he wrote:
> We already make RAW and can only have one column.
> if RAW has no default delimiter, then COPY FROM a text file will
> become one datum value;
> which makes it looks like importing a Large Object.
> (https://www.postgresql.org/docs/17/lo-funcs.html)

The single datum value might not come from a physical column; it could be
an aggregated JSON value, as in the example Daniel mentioned:

On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote:
> copy (select json_agg(col) from table ) to 'file' RAW
>
> This is a variant of the discussion in [1] where the OP does:
>
> copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'
>
> and he complains that both text and csv "break the JSON".
> That discussion morphed into a proposed patch adding JSON
> format to COPY, but RAW would work directly as the OP
> expected.
>
> That is, unless <query> happens to include JSON fields with LF/CRLF
> in them, and the RAW format says this is an error condition.
> In that case it's quite annoying to make it an error, rather than
> simply let it pass.
>
> [1]
> https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com

In such cases, a user could perform the following:

CREATE TABLE customers (id int, name text, email text);

INSERT INTO customers (id, name, email) VALUES
(1, 'John Doe', 'john(dot)doe(at)example(dot)com'),
(2, 'Jane Smith', 'jane(dot)smith(at)example(dot)com'),
(3, 'Alice Johnson', 'alice(dot)johnson(at)example(dot)com');

COPY (SELECT json_agg(row_to_json(t)) FROM customers t) TO '/tmp/file' (FORMAT raw);

% cat /tmp/file
[{"id":1,"name":"John Doe","email":"john(dot)doe(at)example(dot)com"}, {"id":2,"name":"Jane Smith","email":"jane(dot)smith(at)example(dot)com"}, {"id":3,"name":"Alice Johnson","email":"alice(dot)johnson(at)example(dot)com"}]%

> i think, most of the time, you have more than one row/value to import
> and export?

Yes, probably, but it might not be a physical row. It could be an aggregated
one, like in the example above. When importing, it might be a large JSON array
of objects that is imported into a temporary table and then deserialized into
a proper schema.

The need to load entire files is already fulfilled by pg_read_file(text) -> text,
but there is no pg_write_file(), likely for security reasons.
So COPY TO ... (FORMAT RAW) with no delimiter seems necessary,
and then COPY FROM also needs to work accordingly.

>> The refactoring is now in a separate first single commit, which seems
>> necessary, to separate the new functionality, from the refactoring.
> I agree.

> ProcessCopyOptions
> /* Extract options from the statement node tree */
> foreach(option, options)
> {
> }
> /* --- DELIMITER option --- */
> /* --- NULL option --- */
> /* --- QUOTE option --- */
> Currently the regress test passed, i think that means your refactor is fine.

I believe that a passing test indicates it might be okay,
but a failing test definitely means it's not. :D

I've meticulously refactored one option at a time, checking which code in
ProcessCopyOptions depends on each option field to ensure the semantics
are preserved.

I think the changes are easy to follow, and it's clear that each change is
correct when looking at them individually, though it might be more challenging
when viewing the total change.

I've tried to minimize code movement, preserving as much of the original
code placement as possible.

> in ProcessCopyOptions, maybe we can rearrange the code after the
> foreach loop (foreach(option, options)
> based on the parameters order in
> https://www.postgresql.org/docs/devel/sql-copy.html Parameters section.
> so we can review it by comparing the refactoring with the
> sql-copy.html Parameters section's description.

That would be nice, but unfortunately, it's not possible because the order of
the option code blocks matters due to the setting of defaults in else/else
if branches when an option is not specified.

For example, in the documentation, DEFAULT precedes QUOTE,
but in ProcessCopyOptions, the QUOTE code block must come before
the DEFAULT code block due to the check:

/* Don't allow the CSV quote char to appear in the default string. */

I also believe there's value in minimizing code movement.

>> > We already did column length checking at BeginCopyTo.
>> > no need to "if (list_length(cstate->attnumlist) != 1)" error check in
>> > CopyOneRowTo?
>>
>> Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
>> and DoCopyTo() which in turn calls CopyOneRowTo(),
>> but CopyOneRowTo() is also being called from copy_dest_receive().
>>
>
> BeginCopyTo do the preparation work.
> cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist);
>
> After CopyGetAttnums, the number of attributes for COPY TO cannot be changed.
> right after CopyGetAttnums call then check the length of cstate->attnumlist
> seems fine for me.
> I think in CopyOneRowTo, we can actually
> Assert(list_length(cstate->attnumlist) == 1).
> for raw format.

Right, I've changed it to an Assert instead.

> src10=# drop table if exists x;
> create table x(a int);
> COPY x from stdin (FORMAT raw);
> DROP TABLE
> CREATE TABLE
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
>>> 11
>>> 12
>>> \.
> ERROR: invalid input syntax for type integer: "11
> 12
> "
> CONTEXT: COPY x, line 1, column a: "11
> 12
> "
>
> The above case means COPY FROM STDIN (FORMAT RAW) can only import one
> single value (when successful).
> user need to specify like:
>
> COPY x from stdin (FORMAT raw, delimiter E'\n');
>
> seems raw format default no delimiter is not user friendly.

I have no idea if dealing with .json files that would contain newlines
in between fields, and would therefore need to be imported "as is",
is more common than dealing with e.g. .jsonl files where it's guaranteed
each json value is on a single line.

I think Jacob raised some valid concerns on automagically detecting
newlines, that is how text/csv works, so I don't think we want that.

Maybe the OS default EOL would be an OK default,
if we want it to be the default delimiter, that is.

I have no strong opinion here, except automagical newline detection seems
like a bad idea.

I'm fine with OS default EOL as the default for the delimiter,
or no delimiter as the default.

New patch attached.

Changes:

* Change run-time check to assert in CopyOneRowTo, since checked by caller already.

/Joel

Attachment Content-Type Size
v12-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch application/octet-stream 31.3 KB
v12-0002-Add-raw-format-to-COPY-command.patch application/octet-stream 43.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2024-10-19 15:41:17 Re: Using read_stream in index vacuum
Previous Message Dilip Kumar 2024-10-19 11:55:58 Re: Make default subscription streaming option as Parallel