Re: Make COPY format extendable: Extract COPY TO format implementations

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2025-03-29 05:37:03
Message-ID: CAD21AoBKMNsO+b6wahb6847xwFci1JCfV+JykoMziVgiFxB6cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 26, 2025 at 8:28 PM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
>
> > We need more regression tests for handling the given format name. For example,
> >
> > - more various input patterns.
> > - a function with the specified format name exists but it returns an
> > unexpected Node.
> > - looking for a handler function in a different namespace.
> > etc.
>
> I've added the following tests:
>
> * Wrong input type handler without namespace
> * Wrong input type handler with namespace
> * Wrong return type handler without namespace
> * Wrong return type handler with namespace
> * Wrong return value (Copy*Routine isn't returned) handler without namespace
> * Wrong return value (Copy*Routine isn't returned) handler with namespace
> * Nonexistent handler
> * Invalid qualified name
> * Valid handler without namespace and without search_path
> * Valid handler without namespace and with search_path
> * Valid handler with namespace

Probably we can merge these newly added four files into one .sql file?

Also we need to add some comments to describe what these queries test.
For example, it's not clear to me at a glance what queries in
no-schema.sql are going to test as there is no comment there.

>
> 0002 also includes this.
>
> > I think that we should accept qualified names too as the format name
> > like tablesample does. That way, different extensions implementing the
> > same format can be used.
>
> Implemented. It's implemented after parsing SQL. Is it OK?
> (It seems that tablesample does it in parsing SQL.)

I think it's okay.

One problem in the following chunk I can see is:

+ qualified_format = stringToQualifiedNameList(format, NULL);
+ DeconstructQualifiedName(qualified_format, &schema, &fmt);
+ if (!schema || strcmp(schema, "pg_catalog") == 0)
+ {
+ if (strcmp(fmt, "csv") == 0)
+ opts_out->csv_mode = true;
+ else if (strcmp(fmt, "binary") == 0)
+ opts_out->binary = true;
+ }

Non-qualified names depend on the search_path value so it's not
necessarily a built-in format. If the user specifies 'csv' with
seach_patch = 'myschema, pg_catalog', the COPY command unnecessarily
sets csv_mode true. I think we can instead check if the retrieved
handler function's OID matches the built-in formats' ones. Also, it's
weired to me that cstate has csv_mode and binary fields even though
the format should have already been known by the callback functions.

Regarding the documentation for the existing options, it says "...
only when not using XXX format." some places, where XXX can be
replaced with binary or CSV. Once we support custom formats, 'non-CSV
mode' would actually include custom formats as well, so we need to
update the description too.

>
> > ---
> > +static void
> > +CopyToOneRow(CopyToState cstate, TupleTableSlot *slot)
> > +{
> > + ereport(NOTICE, (errmsg("CopyToOneRow: tts_nvalid=%u",
> > slot->tts_nvalid)));
> > +}
> >
> > Similar to the above comment, the field name 'tts_nvalid' might also
> > be changed in the future, let's use another name.
>
> Hmm. If the field name is changed, we need to change this
> code.

Yes, but if we use independe name in the NOTICE message we would not
need to update the expected files.

>
> > ---
> > +/*
> > + * Export CopySendEndOfRow() for extensions. We want to keep
> > + * CopySendEndOfRow() as a static function for
> > + * optimization. CopySendEndOfRow() calls in this file may be optimized by a
> > + * compiler.
> > + */
> > +void
> > +CopyToStateFlush(CopyToState cstate)
> > +{
> > + CopySendEndOfRow(cstate);
> > +}
> >
> > Is there any reason to use a different name for public functions?
>
> In this patch set, I use "CopyFrom"/"CopyTo" prefixes for
> public APIs for custom COPY FORMAT handler extensions. It
> will help understanding related APIs. Is it strange in
> PostgreSQL?

I see your point. Probably we need to find a better name as the name
CopyToStateFlush doesn't sound well like this API should be called
only once at the end of a row (IOW user might try to call it multiple
times to 'flush' the state while processing a row). How about
CopyToEndOfRow()?

>
> > ---
> > + /* For custom format implementation */
> > + void *opaque; /* private space */
> >
> > How about renaming 'private'?
>
> We should not use "private" because it's a keyword in
> C++. If we use "private" here, we can't include this file
> from C++ code.

Understood.

>
> > ---
> > I've not reviewed the documentation patch yet but I think the patch
> > seems to miss the updates to the description of the FORMAT option in
> > the COPY command section.
>
> I defer this for now. We can revisit the last documentation
> patch after we finalize our API. (Or could someone help us?)
>
> > I think we can reorganize the patch set as follows:
> >
> > 1. Create copyto_internal.h and change COPY_XXX to COPY_SOURCE_XXX and
> > COPY_DEST_XXX accordingly.
> > 2. Support custom format for both COPY TO and COPY FROM.
> > 3. Expose necessary helper functions such as CopySendEndOfRow().
> > 4. Add CopyFromSkipErrorRow().
> > 5. Documentation.
>
> The attached v39 patch set uses the followings:
>
> 0001: Create copyto_internal.h and change COPY_XXX to
> COPY_SOURCE_XXX and COPY_DEST_XXX accordingly.
> (Same as 1. in your suggestion)
> 0002: Support custom format for both COPY TO and COPY FROM.
> (Same as 2. in your suggestion)
> 0003: Expose necessary helper functions such as CopySendEndOfRow()
> and add CopyFromSkipErrorRow().
> (3. + 4. in your suggestion)
> 0004: Define handler functions for built-in formats.
> (Not included in your suggestion)
> 0005: Documentation. (WIP)
> (Same as 5. in your suggestion)

Can we merge 0002 and 0004?

> We can merge 0001 quickly, right?

I don't think it makes sense to push only 0001 as it's a completely
preliminary patch for subsequent patches. It would be prudent to push
it once other patches are ready too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-03-29 05:44:57 Re: Statistics Import and Export
Previous Message Jeff Davis 2025-03-29 05:29:16 Re: Statistics Import and Export