Re: New "single" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: "Daniel Verite" <daniel(at)manitou-mail(dot)org>, "Aleksander Alekseev" <aleksander(at)timescale(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: New "single" COPY format
Date: 2024-11-09 07:07:33
Message-ID: 2ed704e9-cee1-450c-aa84-216af23a3a86@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 8, 2024, at 22:47, David G. Johnston wrote:
> On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>>
>> 1. Text files containing \. in the middle of the file
>> % cat /tmp/test.txt
>> foo
>> \.
>> bar
>>
>> Or another option to turn off the special meaning of \.?
>
> This does seem like an orthogonal option worth considering.

I agree; if we want to integrate this into 'text', it's an option worth considering.

> I'm on board with a new named format that selects the desired defaults
> instead of requiring the user to know and change them all manually.
>
> This seems to me like a "list" format. Implying each row is a list
> entry. Since we have tables the concept of list would likewise
> reasonably imply a single column.
>
> Since newlines are special, i.e., record delimiters, "plain" thus would
> remain misleading. It could be used for a case where the entire file
> is loaded into a new row, single column.

I think 'list' is the best proposal I've heard so far.
New patch attached, only change since v18 is the renaming.

There is one remaining important issue though:

Fields that contain newline characters, cause an irreversibility problem.

It feels wrong to leave this as a potential pitfall for users.

Here's a draft of an idea I'm considering (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it would, by default, result in an error when
using the list format, unless:

- Optional quoting mechanism:
If the QUOTE option is specified, we can allow newlines within fields by
quoting the entire line. Any quote characters within the field would be
handled by doubling them, similar to CSV escaping rules.

/Joel

Attachment Content-Type Size
v19-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch application/octet-stream 18.8 KB
v19-0002-Add-COPY-format-list.patch application/octet-stream 32.2 KB
v19-0003-Reorganize-option-validations.patch application/octet-stream 19.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2024-11-09 07:11:18 Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid
Previous Message Tatsuo Ishii 2024-11-09 06:25:59 Re: Fix for Extra Parenthesis in pgbench progress message