Re: Should CSV parsing be stricter about mid-field quotes?

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Joel Jacobson" <joel(at)compiler(dot)org>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>,pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-18 16:48:39
Message-ID: 5094828d-8f56-47de-a4eb-7dfbbca783c8@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Joel Jacobson wrote:

> I've been using that trick myself many times in the past, but thanks to this
> deep-dive into this topic, it looks to me like TEXT would be a better format
> fit when dealing with unquoted TSV files, or?
>
> OTOH, one would then need to inspect the TSV file doesn't contain \. on an
> empty line...

Note that this is the case for valid CSV contents, since backslash-dot
on a line by itself is both an end-of-data marker for COPY FROM and a
valid CSV line.
Having this line in the data results in either an error or having the
rest of the data silently discarded, depending on the context. There
is some previous discussion about this in [1].
Since the TEXT format doesn't have this kind of problem, one solution
is to filter the data through PROGRAM with an [untrusted CSV]->TEXT
filter. This is to be preferred over direct CSV loading when
strictness or robustness are more important than convenience.

[1]
https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-05-18 17:55:49 Re: Order changes in PG16 since ICU introduction
Previous Message Tom Lane 2023-05-18 16:32:59 Re: Assert failure of the cross-check for nullingrels