Re: Fixing backslash dot for COPY FROM...CSV

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>,pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fixing backslash dot for COPY FROM...CSV
Date: 2024-04-06 22:00:25
Message-ID: 1fba50b1-604c-44f9-b6a6-a3a81e8d0bb8@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

> Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage. I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

foobar\a
foobaz\aother

then we get after import:
f1
--------------
foobara
foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

foobar\.
foobar\.other

is not supposed to produce
f1
--------------
foobar.
foobaz.other
(2 rows)

I see these rules in [1] about backslash:

#1.
"End of data can be represented by a single line containing just
backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
"Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column
delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
"In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
"The following special backslash sequences are recognized by COPY
FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
"Any other backslashed character that is not mentioned in the above
table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

foobar\. => foobar.
foobar\.other => foobar.other

#6.
"However, beware of adding backslashes unnecessarily, since that
might accidentally produce a string matching the end-of-data marker
(\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
"These strings will be recognized before any other backslash
processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?

Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.

[1] https://www.postgresql.org/docs/current/sql-copy.html

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-04-06 22:14:03 Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
Previous Message Sergey Prokhorenko 2024-04-06 21:59:38 Re: UUID v7