Re: Should CSV parsing be stricter about mid-field quotes?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Noah Misch" <noah(at)leadboat(dot)com>
Cc: "Daniel Verite" <daniel(at)manitou-mail(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-04 16:19:31
Message-ID: 47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:
> On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:
>> On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:
>> > COPY FROM file CSV somewhat differs as your example shows,
>> > but it still mishandle \. when unquoted. For instance, consider this
>> > file to load with COPY t FROM '/tmp/t.csv' WITH CSV
>> > $ cat /tmp/t.csv
>> > line 1
>> > \.
>> > line 3
>> > line 4
>> >
>> > It results in having only "line 1" being imported.
>>
>> Hmm, this is a problem for one of the new use-cases I brought up that would be
>> possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
>> where each raw line should be imported "as is" into a single text column.
>>
>> Is there a valid reason why \. is needed for COPY FROM filename?
>
> No.
>
>> It seems to me it would only be necessary for the COPY FROM STDIN case,
>> since files have a natural end-of-file and a known file size.
>
> Right. Even for COPY FROM STDIN, it's not needed anymore since the
> removal of
> protocol v2. psql would still use it to find the end of inline COPY
> data,
> though. Here's another relevant thread:
> https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com

I was very pleased to see commit 7702337:

Do not treat \. as an EOF marker in CSV mode for COPY IN.

Great job!

Thanks to this fix, maybe there is now interest to resume the discussion on
the ideas discussed in this thread?

Recap of ideas:

1. Stricter parsing, reject mid-field quotes

The idea is to prevent balanced mid-field quotes from being silently removed.

Example:

% cat example.csv
id,post
1,<p>Hello there!</p>
2,<a href="http://example.com">Click me!</a>

% psql
# \copy posts from example.csv with csv header;
COPY 2
# SELECT * FROM posts;
id | post
----+------------------------------------------
1 | <p>Hello there!</p>
2 | <a href=http://example.com>Click me!</a>
(2 rows)

Note how the quotes around the URL disappeared.

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

Best regards,

Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-10-04 16:45:52 Re: POC, WIP: OR-clause support for indexes
Previous Message Greg Sabino Mullane 2024-10-04 16:07:27 Re: New PostgreSQL Contributors