Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: david(dot)g(dot)johnston(at)gmail(dot)com
Cc: sawada(dot)mshk(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2025-04-04 06:42:48
Message-ID: 20250404.154248.1031254461649883039.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In <CAKFQuwbhSssKTJyeYo9rn20zffV3L7wdQSbEQ8zwRfC=uXLkVA(at)mail(dot)gmail(dot)com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 31 Mar 2025 10:05:34 -0700,
"David G. Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:

> The CopyFromInFunc API allows for each attribute to somehow
> have its I/O format individualized. But I don't see how that is practical
> or useful, and it adds burden on API users.

If an extension want to use I/O routines, it can use the
CopyFromInFunc API. Otherwise it can provide an empty
function.

For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
uses the CopyFromInFunc API but
https://github.com/kou/pg-copy-arrow/blob/main/copy_arrow.cc
uses an empty function for the CopyFromInFunc API.

The "it adds burden" means that "defining an empty function
is inconvenient", right? See also our past discussion for
this design:

https://www.postgresql.org/message-id/ZbijVn9_51mljMAG%40paquier.xyz

> Keeping empty options does not strike as a bad idea, because this
> forces extension developers to think about this code path rather than
> just ignore it.

> I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a
> property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text
> or Copy_IO_Binary and then just branch to either:
>
> CopyFromTextLikeInFunc & CopyFromTextLikeStart/End
> or
> CopyFromBinaryInFunc & CopyFromStart/End
>
> So, in effect, the only method an extension needs to write is converting
> to/from the 'serialized' form to the text/binary form (text being near
> unanimous).

I object this API. If we choose this API, we can create only
custom COPY formats that compatible with PostgreSQL's
text/binary form. For example, the above jsonlines format
and Apache Arrow format aren't implemented. It's meaningless
to introduce this custom COPY format mechanism with the
suggested API.

> It seems to me that CopyFromOneRow could simply produce a *string
> collection,
> one cell per attribute, and NextCopyFrom could do all of the above on a
> for-loop over *string

You suggest that we use a string collection instead of a
Datum collection in CopyFromOneRow() and convert a string
collection to a Datum collection in NextCopyFrom(), right?

I object this API. Because it has needless string <-> Datum
conversion overhead. For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
parses a JSON value to Datum. If we use this API, we need to
convert parsed Datum to string in an extension and
NextCopyFrom() re-converts the converted string to
Datum. It will slow down custom COPY format.

I want this custom COPY format feature for performance. So
APIs that require needless overhead for non text/csv/binary
formats isn't acceptable to me.

Thanks,
--
kou

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2025-04-04 06:50:07 Re: Draft for basic NUMA observability
Previous Message Amit Kapila 2025-04-04 06:20:18 Re: Conflict detection for multiple_unique_conflicts in logical replication