Re: Make COPY format extendable: Extract COPY TO format implementations

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: david(dot)g(dot)johnston(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, zhjwpku(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2025-03-31 19:35:23
Message-ID: CAD21AoDOcYah-nREv09BB3ZoB-k+Yf1XUfJcDMoq=LLvV1v75w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 29, 2025 at 1:57 AM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
>
> Hi,
>
> In <CAD21AoBKMNsO+b6wahb6847xwFci1JCfV+JykoMziVgiFxB6cw(at)mail(dot)gmail(dot)com>
> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 28 Mar 2025 22:37:03 -0700,
> Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> >> I've added the following tests:
> >>
> >> * Wrong input type handler without namespace
> >> * Wrong input type handler with namespace
> >> * Wrong return type handler without namespace
> >> * Wrong return type handler with namespace
> >> * Wrong return value (Copy*Routine isn't returned) handler without namespace
> >> * Wrong return value (Copy*Routine isn't returned) handler with namespace
> >> * Nonexistent handler
> >> * Invalid qualified name
> >> * Valid handler without namespace and without search_path
> >> * Valid handler without namespace and with search_path
> >> * Valid handler with namespace
> >
> > Probably we can merge these newly added four files into one .sql file?
>
> I know that src/test/regress/sql/ uses this style (one .sql
> file includes many test patterns in one large category). I
> understand that the style is preferable in
> src/test/regress/sql/ because src/test/regress/sql/ has
> tests for many categories.
>
> But do we need to follow the style in
> src/test/modules/*/sql/ too? If we use the style in
> src/test/modules/*/sql/, we need to have only one .sql in
> src/test/modules/*/sql/ because src/test/modules/*/ are for
> each category.
>
> And the current .sql per sub-category style is easy to debug
> (at least for me). For example, if we try qualified name
> cases on debugger, we can use "\i sql/schema.sql" instead of
> extracting target statements from .sql that includes many
> unrelated statements. (Or we can use "\i sql/all.sql" and
> many GDB "continue"s.)
>
> BTW, it seems that src/test/modules/test_ddl_deparse/sql/
> uses .sql per sub-category style. Should we use one .sql
> file for sql/test/modules/test_copy_format/sql/? If it's
> required for merging this patch set, I'll do it.

I'm not sure that the regression test queries are categorized in the
same way as in test_ddl_deparse. While the former have separate .sql
files for different types of inputs (valid inputs and invalid inputs
etc.) , which seems finer graind, the latter has .sql files for each
DDL command.

Most of the queries under test_copy_format/sql verifies the input
patterns of the FORMAT option. I find that the regression tests
included in that directory probably should focus on testing new
callback APIs and some regression tests for FORMAT option handling can
be moved into the normal regression test suite (e.g., in copy.sql or a
new file like copy_format.sql). IIUC testing for invalid input
patterns can be done even without creating artificial wrong handler
functions.

>
> > Also we need to add some comments to describe what these queries test.
> > For example, it's not clear to me at a glance what queries in
> > no-schema.sql are going to test as there is no comment there.
>
> Hmm. You refer no_schema.sql in 0002, right?
>
> ----
> CREATE EXTENSION test_copy_format;
> CREATE TABLE public.test (a smallint, b integer, c bigint);
> INSERT INTO public.test VALUES (1, 2, 3), (12, 34, 56), (123, 456, 789);
> COPY public.test FROM stdin WITH (FORMAT 'test_copy_format');
> \.
> COPY public.test TO stdout WITH (FORMAT 'test_copy_format');
> DROP TABLE public.test;
> DROP EXTENSION test_copy_format;
> ----
>
> In general, COPY FORMAT tests focus on "COPY FROM WITH
> (FORMAT ...)" and "COPY TO WITH (FORMAT ...)". And the file
> name "no_schema" shows that it doesn't use qualified
> name. Based on this, I feel that the above content is very
> straightforward without any comment.
>
> What should we add as comments? For example, do we need the
> following comments?
>
> ----
> -- This extension includes custom COPY handler: test_copy_format
> CREATE EXTENSION test_copy_format;
> -- Test data
> CREATE TABLE public.test (a smallint, b integer, c bigint);
> INSERT INTO public.test VALUES (1, 2, 3), (12, 34, 56), (123, 456, 789);
> -- Use custom COPY handler, test_copy_format, without
> -- schema for FROM.
> COPY public.test FROM stdin WITH (FORMAT 'test_copy_format');
> \.
> -- Use custom COPY handler, test_copy_format, without
> -- schema for TO.
> COPY public.test TO stdout WITH (FORMAT 'test_copy_format');
> -- Cleanup
> DROP TABLE public.test;
> DROP EXTENSION test_copy_format;
> ----

I'd like to see in the comment what the tests expect. Taking the
following queries as an example,

COPY public.test FROM stdin WITH (FORMAT 'test_copy_format');
\.
COPY public.test TO stdout WITH (FORMAT 'test_copy_format');

it would help readers understand the test case better if we have a
comment like for example:

-- Specify the custom format name without schema. We test if both
-- COPY TO and COPY FROM can find the correct handler function
-- in public schema.

>
> > One problem in the following chunk I can see is:
> >
> > + qualified_format = stringToQualifiedNameList(format, NULL);
> > + DeconstructQualifiedName(qualified_format, &schema, &fmt);
> > + if (!schema || strcmp(schema, "pg_catalog") == 0)
> > + {
> > + if (strcmp(fmt, "csv") == 0)
> > + opts_out->csv_mode = true;
> > + else if (strcmp(fmt, "binary") == 0)
> > + opts_out->binary = true;
> > + }
> >
> > Non-qualified names depend on the search_path value so it's not
> > necessarily a built-in format. If the user specifies 'csv' with
> > seach_patch = 'myschema, pg_catalog', the COPY command unnecessarily
> > sets csv_mode true.
>
> I think that we should always use built-in COPY handlers for
> (not-qualified) "text", "csv" and "binary" for
> compatibility. If we allow custom COPY handlers for
> (not-qualified) "text", "csv" and "binary", pg_dump or
> existing dump may be broken. Because we must use the same
> COPY handler when we dump (COPY TO) and we restore (COPY
> FROM).

I agreed.

>
> BTW, the current implementation always uses
> pg_catalog.{text,csv,binary} for (not-qualified) "text",
> "csv" and "binary" even when there are
> myschema.{text,csv,binary}. See
> src/test/modules/test_copy_format/sql/builtin.sql. But I
> haven't looked into it why...

Sorry, I don't follow that. IIUC test_copy_format extension doesn't
create a handler function in myschema schema, and SQLs in builtin.sql
seem to work as expected (specifying a non-qualified built-in format
unconditionally uses the built-in format).

>
> > Also, it's
> > weired to me that cstate has csv_mode and binary fields even though
> > the format should have already been known by the callback functions.
>
> You refer CopyFomratOptions::{csv_mode,binary} not
> Copy{To,From}StateData, right?

Yes. I referred to the wrong one.

> And you suggest that we
> should replace all opts.csv_mode and opts.binary with
> opts.handler == F_CSV and opts.handler == F_BINARY, right?
>
> We can do it but I suggest that we do it as a refactoring
> (or cleanup) in a separated patch for easy to review.

I think that csv_mode and binary are used mostly in
ProcessCopyOptions() so probably we can use local variables for that.
I find there are two other places where to use csv_mode:
NextCopyFromRawFields() and CopyToTextLikeStart(). I think we can
simply check the handler function's OID there, or we can define macros
like COPY_FORMAT_IS_TEXT/CSV/BINARY checking the OID and use them
there.

>
> >> > ---
> >> > +/*
> >> > + * Export CopySendEndOfRow() for extensions. We want to keep
> >> > + * CopySendEndOfRow() as a static function for
> >> > + * optimization. CopySendEndOfRow() calls in this file may be optimized by a
> >> > + * compiler.
> >> > + */
> >> > +void
> >> > +CopyToStateFlush(CopyToState cstate)
> >> > +{
> >> > + CopySendEndOfRow(cstate);
> >> > +}
> >> >
> >> > Is there any reason to use a different name for public functions?
> >>
> >> In this patch set, I use "CopyFrom"/"CopyTo" prefixes for
> >> public APIs for custom COPY FORMAT handler extensions. It
> >> will help understanding related APIs. Is it strange in
> >> PostgreSQL?
> >
> > I see your point. Probably we need to find a better name as the name
> > CopyToStateFlush doesn't sound well like this API should be called
> > only once at the end of a row (IOW user might try to call it multiple
> > times to 'flush' the state while processing a row). How about
> > CopyToEndOfRow()?
>
> CopyToStateFlush() can be called multiple times in a row. It
> can also be called only once with multiple rows. Because it
> just flushes the current buffer.
>
> Existing CopySendEndOfRow() is called at the end of a
> row. (Buffer is flushed at the end of row.) So I think that
> the "EndOfRow" was chosen.
>
> Some custom COPY handlers may not be row based. For example,
> Apache Arrow COPY handler doesn't flush buffer for each row.
> So, we should provide "flush" API not "end of row" API for
> extensibility.

Okay, understood.

>
> >> We can merge 0001 quickly, right?
> >
> > I don't think it makes sense to push only 0001 as it's a completely
> > preliminary patch for subsequent patches. It would be prudent to push
> > it once other patches are ready too.
>
> Hmm. I feel that 0001 is a refactoring category patch like
> merged patches. In general, distinct enum value names are
> easier to understand.

Right, but the patches that have already been merged contributed to
speed up COPY commands, but 0001 patch also introduces
copyto_internal.h, which is not used by anyone in a case where the
custom copy format patch is not merged. Without adding
copyto_internal.h changing enum value names less makes sense to me.

> BTW, does the "other patches" include the documentation
> patch...?

I think that when pushing the main custom COPY format patch, we need
to include the documentation changes into it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2025-03-31 19:45:09 Re: Using read stream in autoprewarm
Previous Message Melanie Plageman 2025-03-31 19:27:54 Re: Using read stream in autoprewarm