Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-13 12:44:48
Message-ID: 9f1e32aa-1267-7d8e-0472-66a04b83d2ea@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-05-13 Sa 04:20, Joel Jacobson wrote:
> On Fri, May 12, 2023, at 21:57, Andrew Dunstan wrote:
>>
>> Maybe this is unexpected by you, but it's not by me. What other sane
>> interpretation of that data could there be? And what CSV producer
>> outputs such horrible content? As you've noted, ours certainly does
>> not. Our rules are clear: quotes within quotes must be escaped
>> (default escape is by doubling the quote char). Allowing partial
>> fields to be quoted was a deliberate decision when CSV parsing was
>> implemented, because examples have been seen in the wild.
>>
>> So I don't think our behaviour is broken or needs fixing. As
>> mentioned by Greg, this is an example of the adage about being
>> liberal in what you accept.
>>
>
> I understand your position, and your points are indeed in line with the
> traditional "Robustness Principle" (aka "Postel's Law") [1] from 1980,
> which
> suggests "be conservative in what you send, be liberal in what you
> accept."
> However, I'd like to offer a different perspective that might be worth
> considering.
>
> A 2021 IETF draft, "The Harmful Consequences of the Robustness
> Principle" [2],
> argues that the flexibility advocated by Postel's Law can lead to
> problems such
> as unclear specifications and a multitude of varying implementations.
> Features
> that initially seem helpful can unexpectedly turn into bugs, resulting in
> unanticipated consequences and data integrity risks.
>
> Based on the feedback from you and others, I'd like to revise my earlier
> proposal. Rather than adding an option to preserve the existing
> behavior, I now
> think it's better to simply report an error in such cases. This
> approach offers
> several benefits: it simplifies the CSV parser, reduces the risk of
> misinterpreting data due to malformed input, and prevents the
> all-too-familiar
> situation where users blindly apply an error hint without
> understanding the
> consequences.
>
> Finally, I acknowledge that we can't foresee the number of CSV
> producers that
> produce mid-field quoting, and this change may cause compatibility
> issues for
> some users. However, I consider this an acceptable tradeoff. Users
> encountering
> the error would receive a clear message explaining that mid-field
> quoting is not
> allowed and that they should change their CSV producer's settings to
> escape
> quotes by doubling the quote character. Importantly, this change
> guarantees that
> previously parsed data won't be misinterpreted, as it only enforces
> stricter
> parsing rules.
>
> [1] https://datatracker.ietf.org/doc/html/rfc761#section-2.10
> [2] https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html
>
>

I'm pretty reluctant to change something that's been working as designed
for almost 20 years, and about which we have hitherto had zero
complaints that I recall.

I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-13 13:45:41 Re: Should CSV parsing be stricter about mid-field quotes?
Previous Message Alexander Lakhin 2023-05-13 10:00:00 Re: Order changes in PG16 since ICU introduction