Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Jacob Champion" <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-15 20:38:06
Message-ID: d3969244-c4b0-40e1-b178-781345c6008b@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:
> Hi,
>
> Idle thoughts from a design perspective -- feel free to ignore, since
> I'm not the target audience for the feature:

Many thanks for looking at this!

> - If the column data stored in Postgres contains newlines, it seems
> like COPY TO won't work "correctly". Is that acceptable?

That's an interesting edge-case to think about.

Rejecting such column data, to ensure all data that can be COPY TO'd,
can be loaded back using COPY FROM, preserving the data intact,
seems attractive because it protects users against unintentional mistakes,
and all the other COPY formats are able to preserve the data unchanged,
when doing a COPY FROM of a file created with COPY TO.

OTOH, if we think of COPY TO in this case as a way of `cat`-ing
text values, it might be more pragmatic to allow it.
With `cat`-ing I mean like if in Unix doing...
cat file1.txt file2.txt > file3.txt
...there is no way to reverse that operation,
that is, to reconstruct file1.txt and file2.txt from file3.txt.

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format. If text column data contains newlines,
users probably ought to be using the text or csv format instead.

> - RAW seems like an okay-ish label, but for something that's doing as
> much magic end-of-line detection as this patch is, I'd personally
> prefer SINGLE (as in, "single column").

It's actually the same end-of-line detection as the text format
in copyfromparse.c's CopyReadLineText(), except the code
is simpler thanks to not having to deal with quotes or escapes.

It basically just learns the newline sequence based on the first
occurrence, and then require it to be the same throughout the file.

The same data files can be tested with the text format,
since they don't contain any escape, quote or delimiter characters.

Different newline styles detected automatically also by format text:

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_lr.data' (FORMAT text);
COPY 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr.data' (FORMAT text);
COPY 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr_lr.data' (FORMAT text);
COPY 2

The mixed newline style causes errors also for the text format:

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_1.data' (FORMAT text);
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_2.data' (FORMAT text);
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_3.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_4.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_5.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

I must confess, I didn't know about this newline detection before reading the
copyfromparse.c source code.

> - Speaking of magic end-of-line detection, can there be a way to turn
> that off? Say, via DELIMITER?
> - Generic DELIMITER support, for any single-byte separator at all,
> might make a "single-column" format more generally applicable. But I
> might be over-architecting. And it would make the COPY TO issue even
> worse...

That's an interesting idea that would provide more flexibility,
though, at the cost of complicating things by overloading the meaning
of DELIMITER.

If aiming to make this more generally applicable,
then at least DELIMITER would need to be multi-byte,
since otherwise the Windows case \r\n couldn't be specified.

But I feel COPY already has quite a lot of options, and I fear it's
quite complicated for users as it is.

What I found appealing with the idea of a new COPY format,
was that instead of overloading the existing options
with more complexity, a new format wouldn't need to affect
the existing options, and the new format could be explained
separately, without making things worse for users not
using this format.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-10-15 20:48:08 Re: Remove deprecated -H option from oid2name
Previous Message Bruce Momjian 2024-10-15 20:37:05 Re: Doc: typo in config.sgml