From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Robert Haas" <robertmhaas(at)gmail(dot)com>,pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Fixing backslash dot for COPY FROM...CSV |
Date: | 2024-04-06 22:00:25 |
Message-ID: | 1fba50b1-604c-44f9-b6a6-a3a81e8d0bb8@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)
FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.
> Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage. I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.
Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot (as opposed to being an error),
just like \a is a?
I mean if eofdata contains
foobar\a
foobaz\aother
then we get after import:
f1
--------------
foobara
foobazaother
(2 rows)
Reading the current doc on the text format, I can't see why
importing:
foobar\.
foobar\.other
is not supposed to produce
f1
--------------
foobar.
foobaz.other
(2 rows)
I see these rules in [1] about backslash:
#1.
"End of data can be represented by a single line containing just
backslash-period (\.)."
foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.
#2.
"Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column
delimiters."
Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.
#3.
"In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character"
Dot is not in that list so #3 does not apply.
#4.
"The following special backslash sequences are recognized by COPY
FROM:" (followed by the table with \b \f, ...,)
Dot is not mentioned.
#5.
"Any other backslashed character that is not mentioned in the above
table will be taken to represent itself"
Here we say that backslash dot represents a dot (unless other
rules apply)
foobar\. => foobar.
foobar\.other => foobar.other
#6.
"However, beware of adding backslashes unnecessarily, since that
might accidentally produce a string matching the end-of-data marker
(\.) or the null string (\N by default)."
So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.
#7
"These strings will be recognized before any other backslash
processing is done."
TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?
Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.
[1] https://www.postgresql.org/docs/current/sql-copy.html
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Jelte Fennema-Nio | 2024-04-06 22:14:03 | Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs |
Previous Message | Sergey Prokhorenko | 2024-04-06 21:59:38 | Re: UUID v7 |