Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-09 12:45:14
Message-ID: 7007bd82-8960-401b-bb21-efdfa2518065@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote:
>
> On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:
>> On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:
>>> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:
>>>> 2. Avoid needing hacks like using E'\x01' as quoting char.
>>>>
>>>> Introduce QUOTE NONE and DELIMITER NONE,
>>>> to allow raw lines to be imported "as is" into a single text column.
>>> As I think I previously indicated, I'm perfectly happy about 2, because
>>> it replaces a far from obvious hack, but I am at best dubious about 1.
>> I've looked at how to implement this, and there is quite a lot of
>> complexity
>> having to do with quoting and escaping.
>>
>> Need guidance on what you think would be best to do:
>>
>> 2a) Should we aim to support all NONE combinations, at the cost of
>> increasing the
>> complexity at all code having to do with quoting, escaping and
>> delimiters?
>>
>> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE
>> ESCAPE NONE case,
>> useful to the real-life scenario we've identified, that is, importing
>> raw log
>> lines into a single column, which could then be handed by a much
>> simpler and
>> probably faster version of CopyReadAttributesCSV(),
>> e.g. named CopyReadAttributesUnquotedUnDelimited() or
>> maybe CopyReadAttributesRaw()?
>> (We also need to modify CopyReadLineText(), but seems we only need a
>> quote_none bool, to skip over the quoting code there, so don't think a
>> separate function is warranted there.)
>>
>> I think ESCAPE NONE should be implied from QUOTE NONE, since the
>> default escape
>> character is the same as the quote character, so if there isn't any
>> quote character, then I think that would imply no escape character
>> either.
>>
>> Can we think of any other valid, useful, realistic, and safe
>> combinations of
>> QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
>> to support?
>>
>> If not, then I think 2b looks more interesting, to reduce risk of
>> accidental
>> misuse, simpler implementation, and since it also should allow importing
>> raw log files faster, thanks to the reduced complexity.
>>
>
>
> Off hand I can't think of a case other than 2b that would apply in the
> real world, although others might like to chime in here. If we're
> going to do that, let's find a shorter way to spell it. In fact, we
> should do that even if we go with 2a.
>
>
>

At the very least you should not need to say ESCAPE NONE, since the
default is to have ESCAPE the same as QUOTE, so QUOTE NONE should imply
ESCAPE NONE.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2024-10-09 12:56:17 Re: why there is not VACUUM FULL CONCURRENTLY?
Previous Message Robert Haas 2024-10-09 12:41:10 Re: Proposal to Enable/Disable Index using ALTER INDEX