Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-09 12:00:53
Message-ID: 67dc3a37-8853-46bd-883e-df8a8c934368@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:
> On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:
>> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:
>>> 2. Avoid needing hacks like using E'\x01' as quoting char.
>>>
>>> Introduce QUOTE NONE and DELIMITER NONE,
>>> to allow raw lines to be imported "as is" into a single text column.
>> As I think I previously indicated, I'm perfectly happy about 2, because
>> it replaces a far from obvious hack, but I am at best dubious about 1.
> I've looked at how to implement this, and there is quite a lot of complexity
> having to do with quoting and escaping.
>
> Need guidance on what you think would be best to do:
>
> 2a) Should we aim to support all NONE combinations, at the cost of increasing the
> complexity at all code having to do with quoting, escaping and delimiters?
>
> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case,
> useful to the real-life scenario we've identified, that is, importing raw log
> lines into a single column, which could then be handed by a much simpler and
> probably faster version of CopyReadAttributesCSV(),
> e.g. named CopyReadAttributesUnquotedUnDelimited() or
> maybe CopyReadAttributesRaw()?
> (We also need to modify CopyReadLineText(), but seems we only need a
> quote_none bool, to skip over the quoting code there, so don't think a
> separate function is warranted there.)
>
> I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape
> character is the same as the quote character, so if there isn't any
> quote character, then I think that would imply no escape character either.
>
> Can we think of any other valid, useful, realistic, and safe combinations of
> QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
> to support?
>
> If not, then I think 2b looks more interesting, to reduce risk of accidental
> misuse, simpler implementation, and since it also should allow importing
> raw log files faster, thanks to the reduced complexity.
>

Off hand I can't think of a case other than 2b that would apply in the
real world, although others might like to chime in here. If we're going
to do that, let's find a shorter way to spell it. In fact, we should do
that even if we go with 2a.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mikael Sand 2024-10-09 12:32:43 Re: Build issue with postgresql 17 undefined reference to `pg_encoding_to_char' and `pg_char_to_encoding'
Previous Message Pavel Stehule 2024-10-09 11:57:52 Re: [PATCH] Add some documentation on how to call internal functions