Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: tomas(dot)vondra(at)enterprisedb(dot)com
Cc: andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-07-24 08:30:59
Message-ID: 20240724.173059.909782980111496972.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In <9172d4eb-6de0-4c6d-beab-8210b7a2219b(at)enterprisedb(dot)com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 22 Jul 2024 14:36:40 +0200,
Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:

> Thanks for the summary/responses. I still think it'd be better to post a
> summary as a separate message, not as yet another post responding to
> someone else. If I was reading the thread, I would not have noticed this
> is meant to be a summary. I'd even consider putting a "THREAD SUMMARY"
> title on the first line, or something like that. Up to you, of course.

It makes sense. I'll do it as a separated e-mail.

> My suggestions would be to maintain this as a series of patches, making
> incremental changes, with the "more complex" or "more experimental"
> parts larger in the series. For example, I can imagine doing this:
>
> 0001 - minimal version of the patch (e.g. current v17)
> 0002 - switch existing formats to the new interface
> 0003 - extend the interface to add bits needed for columnar formats
> 0004 - add DML to create/alter/drop custom implementations
> 0005 - minimal patch with extension adding support for Arrow
>
> Or something like that. The idea is that we still have a coherent story
> of what we're trying to do, and can discuss the incremental changes
> (easier than looking at a large patch). It's even possible to commit
> earlier parts before the later parts are quite cleanup up for commit.
> And some changes changes may not be even meant for commit (e.g. the
> extension) but as guidance / validation for the earlier parts.

OK. I attach the v18 patch set:

0001: add a basic feature (Copy{From,To}Routine)
(same as the v17 but it's based on the current master)
0002: use Copy{From,To}Rountine for the existing formats
(this may not be committed because there is a
profiling related concern)
0003: add support for specifying custom format by "COPY
... WITH (format 'my-format')"
(this also has a test)
0004: export Copy{From,To}StateData
(but this isn't enough to implement custom COPY
FROM/TO handlers as an extension)
0005: add opaque member to Copy{From,To}StateData and export
some functions to read the next data and flush the buffer
(we can implement a PoC Apache Arrow COPY FROM/TO
handler as an extension with this)

https://github.com/kou/pg-copy-arrow is a PoC Apache Arrow
COPY FROM/TO handler as an extension.

Notes:

* 0002: We use "static inline" and "constant argument" for
optimization.
* 0002: This hides NextCopyFromRawFields() in a public
header because it's not used in PostgreSQL and we want to
use "static inline" for it. If it's a problem, we can keep
it and create an internal function for "static inline".
* 0003: We use "CREATE FUNCTION" to register a custom COPY
FROM/TO handler. It's the same approach as tablesample.
* 0004 and 0005: We can mix them but this patch set split
them for easy to review. 0004 just moves the existing
codes. It doesn't change the existing codes.
* PoC: I provide it as a separated repository instead of a
patch because an extension exists as a separated project
in general. If it's a problem, I can provide it as a patch
for contrib/.
* This patch set still has minimal Copy{From,To}Routine. For
example, custom COPY FROM/TO handlers can't process their
own options with this patch set. We may add more callbacks
to Copy{From,To}Routine later based on real world use-cases.

> Unfortunately, there's not much information about what exactly the tests
> did, context (hardware, ...). So I don't know, really. But if you share
> enough information on how to reproduce this, I'm willing to take a look
> and investigate.

Thanks. Here is related information based on the past
e-mails from Michael:

* Use -O2 for optimization build flag
("meson setup --buildtype=release" may be used)
* Use tmpfs for PGDATA
* Disable fsync
* Run on scissors (what is "scissors" in this context...?)
https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c
* Unlogged table may be used
* Use a table that has 30 integer columns (*1)
* Use 5M rows (*2)
* Use '/dev/null' for COPY TO (*3)
* Use blackhole_am for COPY FROM (*4)
https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
* perf is used but used options are unknown (sorry)

(*1) This SQL may be used to create the table:

CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
RETURNS VOID AS
$func$
DECLARE
query text;
BEGIN
query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
FOR i IN 1..num_cols LOOP
query := query || 'a_' || i::text || ' int default 1';
IF i != num_cols THEN
query := query || ', ';
END IF;
END LOOP;
query := query || ')';
EXECUTE format(query);
END
$func$ LANGUAGE plpgsql;
SELECT create_table_cols ('to_tab_30', 30);
SELECT create_table_cols ('from_tab_30', 30);

(*2) This SQL may be used to insert 5M rows:

INSERT INTO to_tab_30 SELECT FROM generate_series(1, 5000000);

(*3) This SQL may be used for COPY TO:

COPY to_tab_30 TO '/dev/null' WITH (FORMAT text);

(*4) This SQL may be used for COPY FROM:

CREATE EXTENSION blackhole_am;
ALTER TABLE from_tab_30 SET ACCESS METHOD blackhole_am;
COPY to_tab_30 TO '/tmp/to_tab_30.txt' WITH (FORMAT text);
COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text);

If there is enough information, could you try?

Thanks,
--
kou

Attachment Content-Type Size
v18-0001-Add-CopyFromRoutine-CopyToRountine.patch text/x-patch 10.4 KB
v18-0002-Use-CopyFromRoutine-CopyToRountine-for-the-exist.patch text/x-patch 43.7 KB
v18-0003-Add-support-for-adding-custom-COPY-TO-format.patch text/x-patch 20.2 KB
v18-0004-Export-CopyToStateData-and-CopyFromStateData.patch text/x-patch 31.1 KB
v18-0005-Add-support-for-implementing-custom-COPY-TO-FROM.patch text/x-patch 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-07-24 08:47:15 Re: pgsql: Add more SQL/JSON constructor functions
Previous Message Alexander Lakhin 2024-07-24 08:00:00 Re: race condition when writing pg_control