Re: Should CSV parsing be stricter about mid-field quotes?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Noah Misch" <noah(at)leadboat(dot)com>
Cc: "Daniel Verite" <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2024-10-09 15:48:57
Message-ID: f679056f-28be-490c-a2a6-a1d9881cd044@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 9, 2024, at 14:45, Andrew Dunstan wrote:
> On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote:
>> On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:
>>> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE
>>> ESCAPE NONE case,
>>> useful to the real-life scenario we've identified, that is, importing
>>> raw log
>>> lines into a single column,
...
>> Off hand I can't think of a case other than 2b that would apply in the
>> real world, although others might like to chime in here. If we're
>> going to do that, let's find a shorter way to spell it. In fact, we
>> should do that even if we go with 2a.

I agree a shorter way to spell it would be nice.

But yet another new option would risk option creep IMO,
especially since we already have quite a lot of options
that can only be used for some formats and/or not together
with other options.

If there would have been an easy way to just hack this into
the CSV mode, in a clean and understandable way, that
wouldn't risk confuse users who are actually importing real CSV files,
that would seem OK to me, but it seems difficult.

However, I feel it might be too much of shoehorning to add this to the CSV mode.

Also, since, if there is no delimiter, that by definition means there cannot be
any "separated" values that the "S" in "CSV" means.

I think it would be nicest to introduce a new "raw" FORMAT,
that clearly get things right already at the top-level,
instead of trying to customize any of the existing formats.

Arbitrary unstructured text files, such as some log files,
seems like a common big enough group of files,
that users are probably already importing to PostgreSQL,
using the various hacks we've discussed.

So providing such users with a new FORMAT that precisely matches their use-case,
seems like the simplest and most intuitive solution.

With a new format, we could clearly define its purpose in the docs:

Raw Format

The "raw" format is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. This format
is ideal when dealing with data that doesn't conform to a structured,
tabular format and lacks delimiters.

Key Characteristics:

- No Field Delimiters:
Each line is considered a complete value without any field separation.

- Single Column Requirement:
The COPY command must specify exactly one column when using the raw format.
Specifying multiple columns will result in an error.

- Literal Data Interpretation:
All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

- No NULL Distinction:
Empty lines are imported as empty strings, not as NULL values.

- No Headers or Metadata:
The format does not support header rows or end-of-data markers;
every line is treated as data.

Notes:

- Error Handling:
An error will occur if you use the raw format without specifying exactly
one column or if the table has multiple columns and no column list is provided.

- Data Preservation:
All characters, including whitespace and special characters, are preserved
exactly as they appear in the file.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-10-09 15:58:45 Re: Should CSV parsing be stricter about mid-field quotes?
Previous Message Junwang Zhao 2024-10-09 15:46:19 Re: general purpose array_sort