Quick Links

Re: New "raw" COPY format

From:	"Joel Jacobson" <joel(at)compiler(dot)org>
To:	"Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: New "raw" COPY format
Date:	2024-11-05 03:22:13
Message-ID:	45c5f1cd-5342-4ddc-85bc-fd335035ac3b@app.fastmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Nov 4, 2024, at 19:34, Masahiko Sawada wrote:
> On Sat, Nov 2, 2024 at 4:08 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
>> > As I mentioned in a separate email, if we use the OS default EOL as
>> > the default EOL in raw format, it would not be necessary to allow it
>> > to be multi characters. I think it's worth considering it.
>>
>> I like the idea, but not sure I understand how it would work.
>>
>> What if a user's OS default is \n (LF) and this user wants
>> to import a Windows text file \r\n (CR LR), which is a
>> multi characters EOL delimiter.
>>
>> Was your idea to make an exception for that particular EOL,
>> or to simply not support that edge case?
>
> IIUC the text and csv formats already support it. We start from the
> EOL_UNKNOWN state and guess the EOL marker while parsing the line. I
> think we can do something similar to what we do in the text and csv
> formats but we won't need to care about quotes and escapes in the raw
> format.

Ah, OK, then I see what you mean.

That's actually how the patch worked initially, but due to comments by
Jacob Champion, the magic EOL detection was removed.

I have no strong opinion, maybe it's fine, since that's how most
text editor seems to work, they detect the EOL automatically.

Maybe we should then also rename the format to SINGLE, like suggested by
Jacob and Andrew, since it perhaps wouldn't be fair to say it's RAW when
it does magic detection.

Below is the relevant part of the discussion earlier in this thread.

I'll await your comments on this before making any changes.

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:
> Hi,
>
> Idle thoughts from a design perspective -- feel free to ignore, since
> I'm not the target audience for the feature:
>
> - If the column data stored in Postgres contains newlines, it seems
> like COPY TO won't work "correctly". Is that acceptable?
> - RAW seems like an okay-ish label, but for something that's doing as
> much magic end-of-line detection as this patch is, I'd personally
> prefer SINGLE (as in, "single column").
> - Speaking of magic end-of-line detection, can there be a way to turn
> that off? Say, via DELIMITER?
> - Generic DELIMITER support, for any single-byte separator at all,
> might make a "single-column" format more generally applicable. But I
> might be over-architecting. And it would make the COPY TO issue even
> worse...
>
> Thanks,
> --Jacob

On Wed, Oct 16, 2024, at 18:04, Jacob Champion wrote:
> On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>> It's actually the same end-of-line detection as the text format
>> in copyfromparse.c's CopyReadLineText(), except the code
>> is simpler thanks to not having to deal with quotes or escapes.
>
> Right, sorry, I hadn't meant to imply that you made it up. :D Just
> that a "raw" format that is actually automagically detecting things
> doesn't seem very "raw" to me, so I prefer the other name.
>
>> It basically just learns the newline sequence based on the first
>> occurrence, and then require it to be the same throughout the file.
>
> A hypothetical type whose text representation can contain '\r' but not
> '\n' still can't be unambiguously round-tripped under this scheme:
> COPY FROM will see the "mixed" line endings and complain, even though
> there's no ambiguity.
>
> Maybe no one will run into that problem in practice? But if they did,
> I think that'd be a pretty frustrating limitation. It'd be nice to
> override the behavior, to change it from "do what you think I mean" to
> "do what I say".
>
>> > - Speaking of magic end-of-line detection, can there be a way to turn
>> > that off? Say, via DELIMITER?
>> > - Generic DELIMITER support, for any single-byte separator at all,
>> > might make a "single-column" format more generally applicable. But I
>> > might be over-architecting. And it would make the COPY TO issue even
>> > worse...
>>
>> That's an interesting idea that would provide more flexibility,
>> though, at the cost of complicating things by overloading the meaning
>> of DELIMITER.
>
> I think that'd be a docs issue rather than a conceptual one, though...
> it's still a delimiter. I wouldn't really expect end-user confusion.
>
>> If aiming to make this more generally applicable,
>> then at least DELIMITER would need to be multi-byte,
>> since otherwise the Windows case \r\n couldn't be specified.
>
> True.
...
> --Jacob

/Joel

In response to

Re: New "raw" COPY format at 2024-11-04 18:34:28 from Masahiko Sawada

Responses

Re: New "raw" COPY format at 2024-11-05 07:01:23 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2024-11-05 03:23:34	Re: define pg_structiszero(addr, s, r)
Previous Message	Ryohei Takahashi (Fujitsu)	2024-11-05 03:02:23	doc: pgevent.dll location