From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Joel Jacobson <joel(at)compiler(dot)org>, Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Should CSV parsing be stricter about mid-field quotes? |
Date: | 2024-10-09 12:00:53 |
Message-ID: | 67dc3a37-8853-46bd-883e-df8a8c934368@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:
> On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:
>> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:
>>> 2. Avoid needing hacks like using E'\x01' as quoting char.
>>>
>>> Introduce QUOTE NONE and DELIMITER NONE,
>>> to allow raw lines to be imported "as is" into a single text column.
>> As I think I previously indicated, I'm perfectly happy about 2, because
>> it replaces a far from obvious hack, but I am at best dubious about 1.
> I've looked at how to implement this, and there is quite a lot of complexity
> having to do with quoting and escaping.
>
> Need guidance on what you think would be best to do:
>
> 2a) Should we aim to support all NONE combinations, at the cost of increasing the
> complexity at all code having to do with quoting, escaping and delimiters?
>
> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case,
> useful to the real-life scenario we've identified, that is, importing raw log
> lines into a single column, which could then be handed by a much simpler and
> probably faster version of CopyReadAttributesCSV(),
> e.g. named CopyReadAttributesUnquotedUnDelimited() or
> maybe CopyReadAttributesRaw()?
> (We also need to modify CopyReadLineText(), but seems we only need a
> quote_none bool, to skip over the quoting code there, so don't think a
> separate function is warranted there.)
>
> I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape
> character is the same as the quote character, so if there isn't any
> quote character, then I think that would imply no escape character either.
>
> Can we think of any other valid, useful, realistic, and safe combinations of
> QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
> to support?
>
> If not, then I think 2b looks more interesting, to reduce risk of accidental
> misuse, simpler implementation, and since it also should allow importing
> raw log files faster, thanks to the reduced complexity.
>
Off hand I can't think of a case other than 2b that would apply in the
real world, although others might like to chime in here. If we're going
to do that, let's find a shorter way to spell it. In fact, we should do
that even if we go with 2a.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Mikael Sand | 2024-10-09 12:32:43 | Re: Build issue with postgresql 17 undefined reference to `pg_encoding_to_char' and `pg_char_to_encoding' |
Previous Message | Pavel Stehule | 2024-10-09 11:57:52 | Re: [PATCH] Add some documentation on how to call internal functions |