From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Noah Misch" <noah(at)leadboat(dot)com> |
Cc: | "Daniel Verite" <daniel(at)manitou-mail(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Should CSV parsing be stricter about mid-field quotes? |
Date: | 2024-10-04 16:19:31 |
Message-ID: | 47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:
> On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:
>> On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:
>> > COPY FROM file CSV somewhat differs as your example shows,
>> > but it still mishandle \. when unquoted. For instance, consider this
>> > file to load with COPY t FROM '/tmp/t.csv' WITH CSV
>> > $ cat /tmp/t.csv
>> > line 1
>> > \.
>> > line 3
>> > line 4
>> >
>> > It results in having only "line 1" being imported.
>>
>> Hmm, this is a problem for one of the new use-cases I brought up that would be
>> possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
>> where each raw line should be imported "as is" into a single text column.
>>
>> Is there a valid reason why \. is needed for COPY FROM filename?
>
> No.
>
>> It seems to me it would only be necessary for the COPY FROM STDIN case,
>> since files have a natural end-of-file and a known file size.
>
> Right. Even for COPY FROM STDIN, it's not needed anymore since the
> removal of
> protocol v2. psql would still use it to find the end of inline COPY
> data,
> though. Here's another relevant thread:
> https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com
I was very pleased to see commit 7702337:
Do not treat \. as an EOF marker in CSV mode for COPY IN.
Great job!
Thanks to this fix, maybe there is now interest to resume the discussion on
the ideas discussed in this thread?
Recap of ideas:
1. Stricter parsing, reject mid-field quotes
The idea is to prevent balanced mid-field quotes from being silently removed.
Example:
% cat example.csv
id,post
1,<p>Hello there!</p>
2,<a href="http://example.com">Click me!</a>
% psql
# \copy posts from example.csv with csv header;
COPY 2
# SELECT * FROM posts;
id | post
----+------------------------------------------
1 | <p>Hello there!</p>
2 | <a href=http://example.com>Click me!</a>
(2 rows)
Note how the quotes around the URL disappeared.
2. Avoid needing hacks like using E'\x01' as quoting char.
Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.
Best regards,
Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2024-10-04 16:45:52 | Re: POC, WIP: OR-clause support for indexes |
Previous Message | Greg Sabino Mullane | 2024-10-04 16:07:27 | Re: New PostgreSQL Contributors |