From: | Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org> |
Cc: | Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: New "raw" COPY format |
Date: | 2024-10-16 16:04:50 |
Message-ID: | CAOYmi+=4trybU1sUOTxfZE43eWcQTq=-RLMaUSgeeX2404GiUQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> However, I thinking rejecting such column data seems like the
> better alternative, to ensure data exported with COPY TO
> can always be imported back using COPY FROM,
> for the same format. If text column data contains newlines,
> users probably ought to be using the text or csv format instead.
Yeah. I think _someone's_ going to have strong opinions one way or the
other, but that person is not me. And I assume a contents check during
COPY TO is going to have a noticeable performance impact...
> > - RAW seems like an okay-ish label, but for something that's doing as
> > much magic end-of-line detection as this patch is, I'd personally
> > prefer SINGLE (as in, "single column").
>
> It's actually the same end-of-line detection as the text format
> in copyfromparse.c's CopyReadLineText(), except the code
> is simpler thanks to not having to deal with quotes or escapes.
Right, sorry, I hadn't meant to imply that you made it up. :D Just
that a "raw" format that is actually automagically detecting things
doesn't seem very "raw" to me, so I prefer the other name.
> It basically just learns the newline sequence based on the first
> occurrence, and then require it to be the same throughout the file.
A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.
Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".
> > - Speaking of magic end-of-line detection, can there be a way to turn
> > that off? Say, via DELIMITER?
> > - Generic DELIMITER support, for any single-byte separator at all,
> > might make a "single-column" format more generally applicable. But I
> > might be over-architecting. And it would make the COPY TO issue even
> > worse...
>
> That's an interesting idea that would provide more flexibility,
> though, at the cost of complicating things by overloading the meaning
> of DELIMITER.
I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.
> If aiming to make this more generally applicable,
> then at least DELIMITER would need to be multi-byte,
> since otherwise the Windows case \r\n couldn't be specified.
True.
> What I found appealing with the idea of a new COPY format,
> was that instead of overloading the existing options
> with more complexity, a new format wouldn't need to affect
> the existing options, and the new format could be explained
> separately, without making things worse for users not
> using this format.
I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.
--Jacob
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-10-16 16:12:53 | Re: Misleading error "permission denied for table" |
Previous Message | Nathan Bossart | 2024-10-16 15:54:45 | Re: Large expressions in indexes can't be stored (non-TOASTable) |