Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: sawada(dot)mshk(at)gmail(dot)com
Cc: david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2025-03-29 08:57:53
Message-ID: 20250329.175753.98922209929527465.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In <CAD21AoBKMNsO+b6wahb6847xwFci1JCfV+JykoMziVgiFxB6cw(at)mail(dot)gmail(dot)com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 28 Mar 2025 22:37:03 -0700,
Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

>> I've added the following tests:
>>
>> * Wrong input type handler without namespace
>> * Wrong input type handler with namespace
>> * Wrong return type handler without namespace
>> * Wrong return type handler with namespace
>> * Wrong return value (Copy*Routine isn't returned) handler without namespace
>> * Wrong return value (Copy*Routine isn't returned) handler with namespace
>> * Nonexistent handler
>> * Invalid qualified name
>> * Valid handler without namespace and without search_path
>> * Valid handler without namespace and with search_path
>> * Valid handler with namespace
>
> Probably we can merge these newly added four files into one .sql file?

I know that src/test/regress/sql/ uses this style (one .sql
file includes many test patterns in one large category). I
understand that the style is preferable in
src/test/regress/sql/ because src/test/regress/sql/ has
tests for many categories.

But do we need to follow the style in
src/test/modules/*/sql/ too? If we use the style in
src/test/modules/*/sql/, we need to have only one .sql in
src/test/modules/*/sql/ because src/test/modules/*/ are for
each category.

And the current .sql per sub-category style is easy to debug
(at least for me). For example, if we try qualified name
cases on debugger, we can use "\i sql/schema.sql" instead of
extracting target statements from .sql that includes many
unrelated statements. (Or we can use "\i sql/all.sql" and
many GDB "continue"s.)

BTW, it seems that src/test/modules/test_ddl_deparse/sql/
uses .sql per sub-category style. Should we use one .sql
file for sql/test/modules/test_copy_format/sql/? If it's
required for merging this patch set, I'll do it.

> Also we need to add some comments to describe what these queries test.
> For example, it's not clear to me at a glance what queries in
> no-schema.sql are going to test as there is no comment there.

Hmm. You refer no_schema.sql in 0002, right?

----
CREATE EXTENSION test_copy_format;
CREATE TABLE public.test (a smallint, b integer, c bigint);
INSERT INTO public.test VALUES (1, 2, 3), (12, 34, 56), (123, 456, 789);
COPY public.test FROM stdin WITH (FORMAT 'test_copy_format');
\.
COPY public.test TO stdout WITH (FORMAT 'test_copy_format');
DROP TABLE public.test;
DROP EXTENSION test_copy_format;
----

In general, COPY FORMAT tests focus on "COPY FROM WITH
(FORMAT ...)" and "COPY TO WITH (FORMAT ...)". And the file
name "no_schema" shows that it doesn't use qualified
name. Based on this, I feel that the above content is very
straightforward without any comment.

What should we add as comments? For example, do we need the
following comments?

----
-- This extension includes custom COPY handler: test_copy_format
CREATE EXTENSION test_copy_format;
-- Test data
CREATE TABLE public.test (a smallint, b integer, c bigint);
INSERT INTO public.test VALUES (1, 2, 3), (12, 34, 56), (123, 456, 789);
-- Use custom COPY handler, test_copy_format, without
-- schema for FROM.
COPY public.test FROM stdin WITH (FORMAT 'test_copy_format');
\.
-- Use custom COPY handler, test_copy_format, without
-- schema for TO.
COPY public.test TO stdout WITH (FORMAT 'test_copy_format');
-- Cleanup
DROP TABLE public.test;
DROP EXTENSION test_copy_format;
----

> One problem in the following chunk I can see is:
>
> + qualified_format = stringToQualifiedNameList(format, NULL);
> + DeconstructQualifiedName(qualified_format, &schema, &fmt);
> + if (!schema || strcmp(schema, "pg_catalog") == 0)
> + {
> + if (strcmp(fmt, "csv") == 0)
> + opts_out->csv_mode = true;
> + else if (strcmp(fmt, "binary") == 0)
> + opts_out->binary = true;
> + }
>
> Non-qualified names depend on the search_path value so it's not
> necessarily a built-in format. If the user specifies 'csv' with
> seach_patch = 'myschema, pg_catalog', the COPY command unnecessarily
> sets csv_mode true.

I think that we should always use built-in COPY handlers for
(not-qualified) "text", "csv" and "binary" for
compatibility. If we allow custom COPY handlers for
(not-qualified) "text", "csv" and "binary", pg_dump or
existing dump may be broken. Because we must use the same
COPY handler when we dump (COPY TO) and we restore (COPY
FROM).

BTW, the current implementation always uses
pg_catalog.{text,csv,binary} for (not-qualified) "text",
"csv" and "binary" even when there are
myschema.{text,csv,binary}. See
src/test/modules/test_copy_format/sql/builtin.sql. But I
haven't looked into it why...

> I think we can instead check if the retrieved
> handler function's OID matches the built-in formats' ones.

I agree that the approach is clear than the current
implementation. I'll use it when I create the next patch
set.

> Also, it's
> weired to me that cstate has csv_mode and binary fields even though
> the format should have already been known by the callback functions.

You refer CopyFomratOptions::{csv_mode,binary} not
Copy{To,From}StateData, right? And you suggest that we
should replace all opts.csv_mode and opts.binary with
opts.handler == F_CSV and opts.handler == F_BINARY, right?

We can do it but I suggest that we do it as a refactoring
(or cleanup) in a separated patch for easy to review.

> Regarding the documentation for the existing options, it says "...
> only when not using XXX format." some places, where XXX can be
> replaced with binary or CSV. Once we support custom formats, 'non-CSV
> mode' would actually include custom formats as well, so we need to
> update the description too.

I agree with you.

>> > ---
>> > +/*
>> > + * Export CopySendEndOfRow() for extensions. We want to keep
>> > + * CopySendEndOfRow() as a static function for
>> > + * optimization. CopySendEndOfRow() calls in this file may be optimized by a
>> > + * compiler.
>> > + */
>> > +void
>> > +CopyToStateFlush(CopyToState cstate)
>> > +{
>> > + CopySendEndOfRow(cstate);
>> > +}
>> >
>> > Is there any reason to use a different name for public functions?
>>
>> In this patch set, I use "CopyFrom"/"CopyTo" prefixes for
>> public APIs for custom COPY FORMAT handler extensions. It
>> will help understanding related APIs. Is it strange in
>> PostgreSQL?
>
> I see your point. Probably we need to find a better name as the name
> CopyToStateFlush doesn't sound well like this API should be called
> only once at the end of a row (IOW user might try to call it multiple
> times to 'flush' the state while processing a row). How about
> CopyToEndOfRow()?

CopyToStateFlush() can be called multiple times in a row. It
can also be called only once with multiple rows. Because it
just flushes the current buffer.

Existing CopySendEndOfRow() is called at the end of a
row. (Buffer is flushed at the end of row.) So I think that
the "EndOfRow" was chosen.

Some custom COPY handlers may not be row based. For example,
Apache Arrow COPY handler doesn't flush buffer for each row.
So, we should provide "flush" API not "end of row" API for
extensibility.

>> The attached v39 patch set uses the followings:
>>
>> 0001: Create copyto_internal.h and change COPY_XXX to
>> COPY_SOURCE_XXX and COPY_DEST_XXX accordingly.
>> (Same as 1. in your suggestion)
>> 0002: Support custom format for both COPY TO and COPY FROM.
>> (Same as 2. in your suggestion)
>> 0003: Expose necessary helper functions such as CopySendEndOfRow()
>> and add CopyFromSkipErrorRow().
>> (3. + 4. in your suggestion)
>> 0004: Define handler functions for built-in formats.
>> (Not included in your suggestion)
>> 0005: Documentation. (WIP)
>> (Same as 5. in your suggestion)
>
> Can we merge 0002 and 0004?

Can we do it when we merge this patch set if it's still
desirable at the time? Because:

* I think that separated 0002 and 0004 patches are easier to
review than squashed 0002 and 0004 patch.
* I still think that someone may don't like defining COPY
handlers for built-in formats. If we don't define COPY
handlers for built-in formats finally, we can just drop
0004.

>> We can merge 0001 quickly, right?
>
> I don't think it makes sense to push only 0001 as it's a completely
> preliminary patch for subsequent patches. It would be prudent to push
> it once other patches are ready too.

Hmm. I feel that 0001 is a refactoring category patch like
merged patches. In general, distinct enum value names are
easier to understand.

BTW, does the "other patches" include the documentation
patch...?

Thanks,
--
kou

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2025-03-29 11:03:07 Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Previous Message jian he 2025-03-29 08:12:24 Re: support ALTER TABLE DROP EXPRESSION for virtual generated column