Re: New "raw" COPY format

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-16 16:04:50
Message-ID: CAOYmi+=4trybU1sUOTxfZE43eWcQTq=-RLMaUSgeeX2404GiUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> However, I thinking rejecting such column data seems like the
> better alternative, to ensure data exported with COPY TO
> can always be imported back using COPY FROM,
> for the same format. If text column data contains newlines,
> users probably ought to be using the text or csv format instead.

Yeah. I think _someone's_ going to have strong opinions one way or the
other, but that person is not me. And I assume a contents check during
COPY TO is going to have a noticeable performance impact...

> > - RAW seems like an okay-ish label, but for something that's doing as
> > much magic end-of-line detection as this patch is, I'd personally
> > prefer SINGLE (as in, "single column").
>
> It's actually the same end-of-line detection as the text format
> in copyfromparse.c's CopyReadLineText(), except the code
> is simpler thanks to not having to deal with quotes or escapes.

Right, sorry, I hadn't meant to imply that you made it up. :D Just
that a "raw" format that is actually automagically detecting things
doesn't seem very "raw" to me, so I prefer the other name.

> It basically just learns the newline sequence based on the first
> occurrence, and then require it to be the same throughout the file.

A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.

Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".

> > - Speaking of magic end-of-line detection, can there be a way to turn
> > that off? Say, via DELIMITER?
> > - Generic DELIMITER support, for any single-byte separator at all,
> > might make a "single-column" format more generally applicable. But I
> > might be over-architecting. And it would make the COPY TO issue even
> > worse...
>
> That's an interesting idea that would provide more flexibility,
> though, at the cost of complicating things by overloading the meaning
> of DELIMITER.

I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.

> If aiming to make this more generally applicable,
> then at least DELIMITER would need to be multi-byte,
> since otherwise the Windows case \r\n couldn't be specified.

True.

> What I found appealing with the idea of a new COPY format,
> was that instead of overloading the existing options
> with more complexity, a new format wouldn't need to affect
> the existing options, and the new format could be explained
> separately, without making things worse for users not
> using this format.

I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.

--Jacob

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-10-16 16:12:53 Re: Misleading error "permission denied for table"
Previous Message Nathan Bossart 2024-10-16 15:54:45 Re: Large expressions in indexes can't be stored (non-TOASTable)