Re: New "single" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Aleksander Alekseev" <aleksander(at)timescale(dot)com>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: New "single" COPY format
Date: 2024-11-08 12:31:30
Message-ID: 55dee2a2-31e9-4be4-a095-0a89031a9036@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 8, 2024, at 12:25, Aleksander Alekseev wrote:
> Sorry for being late for the discussion.

No worries, better late than never, thanks for chiming in.

> I disagree with the idea of adding a new format name for this. Mostly
> because it is *not* a new format and pretending that it is will be
> just a poor and/or confusing user interface.
>
> IMO it should be 'text' we already have with special options e.g.
> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
> and column delimiters (and no NULLs designations, and what else I
> forgot) then your text file just contains one tuple per line.
>
> Personally I wouldn't mind a special syntax such as LINES AS IS or
> maybe COPY AS IS for convenience. Perhaps we should discuss it
> separately though as a syntax sugar for a long list of options we
> already support.

From an implementation perspective, I agree with you that this could
be handled by tweaking the existing code for the 'text' and 'csv' formats,
although with a performance penalty for the existing formats.

But from a user-perspective, the implementation is of course irrelevant,
then what I think is important, is that the format should have an intuitive name,
where the default behaviour should match a typical file in the format,
as closely as possible.

For this reason, the 'text' format is unfortunately a poor name,
since it gives the impression it's a generic format for text files,
which it's not, it's a PostgreSQL-specific format, where "\." on a
single line has special meaning, and other defaults such as \N
are also PostgreSQL-specific, and needs to be overriden, if dealing
with a non-PostgreSQL specific text file.
Users who fail to understand these details, risks being surprised.

In contrast, the 'csv' format, works quite as expected.

So for this reason, I think a new format, is a good idea, not only
because it makes it much clearer how to have a fast parsing path
in the implementation, but also because it will increase the chances
users will get things right, when dealing with non-PostgreSQL specific
text files, such as JSONL and log files.

Sure, from an implementation perspective, we could have separate
specialized functions, to allow for fast parsing paths, even if
just overloading the existing options, but that would be a bit awkward I think.

The "DELIMITER AS NULL ESCAPE AS NULL" idea was proposed in the old thread
"Should CSV parsing be stricter about mid-field quotes?" [1]

[1] https://postgr.es/m/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

However, some of us came to the conclusion that it would be
better to introduce a new format, for reasons explained below,
quoted from the old thread [1]:

On Wed, Oct 9, 2024, at 18:14, Andrew Dunstan wrote:
> On 2024-10-09 We 11:58 AM, Tom Lane wrote:
>> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
>>> I think it would be nicest to introduce a new "raw" FORMAT,
>>> that clearly get things right already at the top-level,
>>> instead of trying to customize any of the existing formats.
>>
>> FWIW, I like this idea. It makes it much clearer how to have a
>> fast parsing path.
>
> WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem
> to have an existing keyword that we could sanely reuse here.

To add to that, I think there is value of a new format, from a user-friendiness
perspective, by keeping the existing formats and their options intact,
and instead just add a new format, with a clear name, with well-defined
semantics, explained in the docs under its own section, to avoid cluttering
the documentation further, where users would need to assemble various options,
and understand their intricate details, in order to get things right.

/Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-08 13:08:10 Re: not null constraints, again
Previous Message Kirill Reshke 2024-11-08 12:21:46 Re: Page freezing, FSM, and WAL replay