Re: New "raw" COPY format

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-19 10:13:50
Message-ID: CACJufxG7gE4XGCQJ9UG0ki2YnyWFUhJ_QoWm0TE42QZZVyz8Hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 19, 2024 at 1:24 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
> Handling of e.g. JSON and other structured text files that could contain
> newlines, in a seamless way seems important, so therefore the default is
> no delimiter for the raw format, so that the entire input is read as one data
> value for COPY FROM, and all column data is concatenated without delimiter
> for COPY TO.
>
> When specifying a delimiter for the raw format, it separates *rows*, and can be
> a multi-byte string, such as E'\r\n' to handle Windows text files.
>
> This has been documented under the DELIMITER option, as well as under the
> Raw Format section.
>
We already make RAW and can only have one column.
if RAW has no default delimiter, then COPY FROM a text file will
become one datum value;
which makes it looks like importing a Large Object.
(https://www.postgresql.org/docs/17/lo-funcs.html)

i think, most of the time, you have more than one row/value to import
and export?

> The refactoring is now in a separate first single commit, which seems
> necessary, to separate the new functionality, from the refactoring.
I agree.

ProcessCopyOptions
/* Extract options from the statement node tree */
foreach(option, options)
{
}
/* --- DELIMITER option --- */
/* --- NULL option --- */
/* --- QUOTE option --- */
Currently the regress test passed, i think that means your refactor is fine.

in ProcessCopyOptions, maybe we can rearrange the code after the
foreach loop (foreach(option, options)
based on the parameters order in
https://www.postgresql.org/docs/devel/sql-copy.html Parameters section.
so we can review it by comparing the refactoring with the
sql-copy.html Parameters section's description.

>
> > We already did column length checking at BeginCopyTo.
> > no need to "if (list_length(cstate->attnumlist) != 1)" error check in
> > CopyOneRowTo?
>
> Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
> and DoCopyTo() which in turn calls CopyOneRowTo(),
> but CopyOneRowTo() is also being called from copy_dest_receive().
>

BeginCopyTo do the preparation work.
cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist);

After CopyGetAttnums, the number of attributes for COPY TO cannot be changed.
right after CopyGetAttnums call then check the length of cstate->attnumlist
seems fine for me.
I think in CopyOneRowTo, we can actually
Assert(list_length(cstate->attnumlist) == 1).
for raw format.

src10=# drop table if exists x;
create table x(a int);
COPY x from stdin (FORMAT raw);
DROP TABLE
CREATE TABLE
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 11
>> 12
>> \.
ERROR: invalid input syntax for type integer: "11
12
"
CONTEXT: COPY x, line 1, column a: "11
12
"

The above case means COPY FROM STDIN (FORMAT RAW) can only import one
single value (when successful).
user need to specify like:

COPY x from stdin (FORMAT raw, delimiter E'\n');

seems raw format default no delimiter is not user friendly.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-10-19 11:02:32 Re: Wrong security context for deferred triggers?
Previous Message Andrey M. Borodin 2024-10-19 08:41:50 Using read_stream in index vacuum