Re: Should CSV parsing be stricter about mid-field quotes?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Noah Misch" <noah(at)leadboat(dot)com>
Cc: "Daniel Verite" <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-08 07:25:01
Message-ID: a1f15807-23ee-4d8b-9ab9-4ec6870cdb57@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:
> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:
>> 2. Avoid needing hacks like using E'\x01' as quoting char.
>>
>> Introduce QUOTE NONE and DELIMITER NONE,
>> to allow raw lines to be imported "as is" into a single text column.
>
> As I think I previously indicated, I'm perfectly happy about 2, because
> it replaces a far from obvious hack, but I am at best dubious about 1.

I've looked at how to implement this, and there is quite a lot of complexity
having to do with quoting and escaping.

Need guidance on what you think would be best to do:

2a) Should we aim to support all NONE combinations, at the cost of increasing the
complexity at all code having to do with quoting, escaping and delimiters?

2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case,
useful to the real-life scenario we've identified, that is, importing raw log
lines into a single column, which could then be handed by a much simpler and
probably faster version of CopyReadAttributesCSV(),
e.g. named CopyReadAttributesUnquotedUnDelimited() or
maybe CopyReadAttributesRaw()?
(We also need to modify CopyReadLineText(), but seems we only need a
quote_none bool, to skip over the quoting code there, so don't think a
separate function is warranted there.)

I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape
character is the same as the quote character, so if there isn't any
quote character, then I think that would imply no escape character either.

Can we think of any other valid, useful, realistic, and safe combinations of
QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
to support?

If not, then I think 2b looks more interesting, to reduce risk of accidental
misuse, simpler implementation, and since it also should allow importing
raw log files faster, thanks to the reduced complexity.

Best regards,

Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Denis Laxalde 2024-10-08 07:25:44 Re: Proposal: allow database-specific role memberships
Previous Message Laurenz Albe 2024-10-08 06:36:43 Re: First draft of PG 17 release notes