Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: sawada(dot)mshk(at)gmail(dot)com
Cc: david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2025-03-27 03:28:40
Message-ID: 20250327.122840.991624455773269772.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In <CAD21AoAfWrjpTDJ0garVUoXY0WC3Ud4Cu51q+ccWiotm1uo_2A(at)mail(dot)gmail(dot)com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Sun, 23 Mar 2025 02:01:59 -0700,
Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

> ---
> +/*
> + * Process the "format" option.
> + *
> + * This function checks whether the option value is a built-in format such as
> + * "text" and "csv" or not. If the option value isn't a built-in format, this
> + * function finds a COPY format handler that returns a CopyToRoutine (for
> + * is_from == false) or CopyFromRountine (for is_from == true). If no COPY
> + * format handler is found, this function reports an error.
> + */
>
> I think this comment needs to be updated as the part "If the option
> value isn't ..." is no longer true.
>
> I think we don't necessarily need to create a separate function
> ProcessCopyOptionFormat for processing the format option.

Hmm. I think that this separated function will increase
readability by reducing indentation. But I've removed the
separation as you suggested. So the comment is also removed
entirely.

0002 includes this.

> We need more regression tests for handling the given format name. For example,
>
> - more various input patterns.
> - a function with the specified format name exists but it returns an
> unexpected Node.
> - looking for a handler function in a different namespace.
> etc.

I've added the following tests:

* Wrong input type handler without namespace
* Wrong input type handler with namespace
* Wrong return type handler without namespace
* Wrong return type handler with namespace
* Wrong return value (Copy*Routine isn't returned) handler without namespace
* Wrong return value (Copy*Routine isn't returned) handler with namespace
* Nonexistent handler
* Invalid qualified name
* Valid handler without namespace and without search_path
* Valid handler without namespace and with search_path
* Valid handler with namespace

0002 also includes this.

> I think that we should accept qualified names too as the format name
> like tablesample does. That way, different extensions implementing the
> same format can be used.

Implemented. It's implemented after parsing SQL. Is it OK?
(It seems that tablesample does it in parsing SQL.)

Because "WITH (FORMAT XXX)" is processed as a generic option
in gram.y. All generic options are processed as strings. So
I keep this.

Syntax is "COPY ... WITH (FORMAT 'NAMESPACE.HANDLER_NAME')"
not "COPY ... WITH (FORMAT 'NAMESPACE'.'HANDLER_NAME')"
because of this choice.

0002 also includes this.

> ---
> + if (routine == NULL || !IsA(routine, CopyFromRoutine))
> + ereport(
> + ERROR,
> + (errcode(
> +
> ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("COPY handler function "
> + "%u did not return "
> + "CopyFromRoutine struct",
> + opts->handler)));
>
> It's not conventional to put a new line between 'ereport(' and 'ERROR'
> (similarly between 'errcode(' and 'ERRCODE_...'. Also, we don't need
> to split the error message into multiple lines as it's not long.

Oh, sorry. I can't remember why I used this... I think I
trusted pgindent...

> ---
> + descr => 'pseudo-type for the result of a copy to/from method function',
>
> s/method function/format function/

Good catch. I used "handler function" not "format function"
because we use "handler" in other places.

> ---
> + Oid handler; /* handler
> function for custom format routine */
>
> 'handler' is used also for built-in formats.

Updated in 0004.

> ---
> +static void
> +CopyFromInFunc(CopyFromState cstate, Oid atttypid,
> + FmgrInfo *finfo, Oid *typioparam)
> +{
> + ereport(NOTICE, (errmsg("CopyFromInFunc: atttypid=%d", atttypid)));
> +}
>
> OIDs could be changed across major versions even for built-in types. I
> think it's better to avoid using it for tests.

Oh, I didn't know it. I've changed to use type name instead
of OID. It'll be more stable than OID.

> ---
> +static void
> +CopyToOneRow(CopyToState cstate, TupleTableSlot *slot)
> +{
> + ereport(NOTICE, (errmsg("CopyToOneRow: tts_nvalid=%u",
> slot->tts_nvalid)));
> +}
>
> Similar to the above comment, the field name 'tts_nvalid' might also
> be changed in the future, let's use another name.

Hmm. If the field name is changed, we need to change this
code. So changing tests too isn't strange. Anyway, I used
more generic text.

> ---
> +static const CopyFromRoutine CopyFromRoutineTestCopyFormat = {
> + .type = T_CopyFromRoutine,
> + .CopyFromInFunc = CopyFromInFunc,
> + .CopyFromStart = CopyFromStart,
> + .CopyFromOneRow = CopyFromOneRow,
> + .CopyFromEnd = CopyFromEnd,
> +};
>
> I'd suggest not using the same function names as the fields.

OK. I've added "Test" prefix.

> ---
> +/*
> + * Export CopySendEndOfRow() for extensions. We want to keep
> + * CopySendEndOfRow() as a static function for
> + * optimization. CopySendEndOfRow() calls in this file may be optimized by a
> + * compiler.
> + */
> +void
> +CopyToStateFlush(CopyToState cstate)
> +{
> + CopySendEndOfRow(cstate);
> +}
>
> Is there any reason to use a different name for public functions?

In this patch set, I use "CopyFrom"/"CopyTo" prefixes for
public APIs for custom COPY FORMAT handler extensions. It
will help understanding related APIs. Is it strange in
PostgreSQL?

> ---
> + /* For custom format implementation */
> + void *opaque; /* private space */
>
> How about renaming 'private'?

We should not use "private" because it's a keyword in
C++. If we use "private" here, we can't include this file
from C++ code.

> ---
> I've not reviewed the documentation patch yet but I think the patch
> seems to miss the updates to the description of the FORMAT option in
> the COPY command section.

I defer this for now. We can revisit the last documentation
patch after we finalize our API. (Or could someone help us?)

> I think we can reorganize the patch set as follows:
>
> 1. Create copyto_internal.h and change COPY_XXX to COPY_SOURCE_XXX and
> COPY_DEST_XXX accordingly.
> 2. Support custom format for both COPY TO and COPY FROM.
> 3. Expose necessary helper functions such as CopySendEndOfRow().
> 4. Add CopyFromSkipErrorRow().
> 5. Documentation.

The attached v39 patch set uses the followings:

0001: Create copyto_internal.h and change COPY_XXX to
COPY_SOURCE_XXX and COPY_DEST_XXX accordingly.
(Same as 1. in your suggestion)
0002: Support custom format for both COPY TO and COPY FROM.
(Same as 2. in your suggestion)
0003: Expose necessary helper functions such as CopySendEndOfRow()
and add CopyFromSkipErrorRow().
(3. + 4. in your suggestion)
0004: Define handler functions for built-in formats.
(Not included in your suggestion)
0005: Documentation. (WIP)
(Same as 5. in your suggestion)

We can merge 0001 quickly, right?

Thanks,
--
kou

Attachment Content-Type Size
v39-0001-Export-CopyDest-as-private-data.patch text/x-patch 7.8 KB
v39-0002-Add-support-for-adding-custom-COPY-format.patch text/x-patch 37.8 KB
v39-0003-Add-support-for-implementing-custom-COPY-handler.patch text/x-patch 14.7 KB
v39-0004-Use-copy-handlers-for-built-in-formats.patch text/x-patch 17.7 KB
v39-0005-Add-document-how-to-write-a-COPY-handler.patch text/x-patch 14.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wenhui qiu 2025-03-27 03:34:31 Re: GSoC 2025 - Looking for Beginner-Friendly PostgreSQL Project
Previous Message Tom Lane 2025-03-27 03:17:17 Re: Add Postgres module info