Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-06 13:12:59
Message-ID: 3e55b039-dcd0-45b1-8b62-f9f67d6695fa@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:
> On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:
>> On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:
>>> On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:
>>>> COPY FROM file CSV somewhat differs as your example shows,
>>>> but it still mishandle \. when unquoted. For instance, consider this
>>>> file to load with COPY t FROM '/tmp/t.csv' WITH CSV
>>>> $ cat /tmp/t.csv
>>>> line 1
>>>> \.
>>>> line 3
>>>> line 4
>>>>
>>>> It results in having only "line 1" being imported.
>>> Hmm, this is a problem for one of the new use-cases I brought up that would be
>>> possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
>>> where each raw line should be imported "as is" into a single text column.
>>>
>>> Is there a valid reason why \. is needed for COPY FROM filename?
>> No.
>>
>>> It seems to me it would only be necessary for the COPY FROM STDIN case,
>>> since files have a natural end-of-file and a known file size.
>> Right. Even for COPY FROM STDIN, it's not needed anymore since the
>> removal of
>> protocol v2. psql would still use it to find the end of inline COPY
>> data,
>> though. Here's another relevant thread:
>> https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com
> I was very pleased to see commit 7702337:
>
> Do not treat \. as an EOF marker in CSV mode for COPY IN.
>
> Great job!
>
> Thanks to this fix, maybe there is now interest to resume the discussion on
> the ideas discussed in this thread?
>
> Recap of ideas:
>
> 1. Stricter parsing, reject mid-field quotes
>
> The idea is to prevent balanced mid-field quotes from being silently removed.
>
> Example:
>
> % cat example.csv
> id,post
> 1,<p>Hello there!</p>
> 2,<a href="http://example.com">Click me!</a>
>
> % psql
> # \copy posts from example.csv with csv header;
> COPY 2
> # SELECT * FROM posts;
> id | post
> ----+------------------------------------------
> 1 | <p>Hello there!</p>
> 2 | <a href=http://example.com>Click me!</a>
> (2 rows)
>
> Note how the quotes around the URL disappeared.
>
> 2. Avoid needing hacks like using E'\x01' as quoting char.
>
> Introduce QUOTE NONE and DELIMITER NONE,
> to allow raw lines to be imported "as is" into a single text column.
>

As I think I previously indicated, I'm perfectly happy about 2, because
it replaces a far from obvious hack, but I am at best dubious about 1.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Lelarge 2024-10-06 13:32:02 Re: Add parallel columns for pg_stat_statements
Previous Message Nathan Bossart 2024-10-06 13:12:51 Re: Should rolpassword be toastable?