Re: Make COPY format extendable: Extract COPY TO format implementations

From: Sutou Kouhei <kou(at)clear-code(dot)com>
To: tomas(dot)vondra(at)enterprisedb(dot)com
Cc: andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-07-25 04:51:38
Message-ID: 20240725.135138.734175363436010682.kou@clear-code.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

THREAD SUMMARY:

Proposal:

How about making COPY format extendable?

Background:

Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:

* 2023-11: JSON and JSON lines [1]
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]

There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.

[1]: https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5
[2]: https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
[4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d

Concerns:

* Performance: If we make COPY format extendable, it will
introduce some overheads. We don't want to loss our
optimization efforts for the current implementations by
this. [5]
* Extendability: We don't know which API set is enough for
custom COPY format implementations yet. We don't want to
provide too much APIs to reduce maintenance cost.

[5]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us

Implementation:

The v18 patch set is the latest patch set. [6]
It includes the following patches:

0001: This adds a basic feature (Copy{From,To}Routine)
(This isn't enough for extending COPY format.
This just extracts minimal procedure sets to be
extendable as callback sets.)
0002: This uses Copy{From,To}Rountine for the existing
formats (text, csv and binary)
(This may not be committed because there is a
profiling related concern. See the following section
for details)
0003: This adds support for specifying custom format by
"COPY ... WITH (format 'my-format')"
(This also adds a test for this feature.)
0004: This exports Copy{From,To}StateData
(But this isn't enough to implement custom COPY
FROM/TO handlers as an extension.)
0005: This adds opaque member to Copy{From,To}StateData and
export some functions to read the next data and flush
the buffer
(We can implement a PoC Apache Arrow COPY FROM/TO
handler as an extension with this. [7])

[6]: https://www.postgresql.org/message-id/flat/20240724.173059.909782980111496972.kou%40clear-code.com
[7]: https://github.com/kou/pg-copy-arrow

Implementation notes:

* 0002: We use "static inline" and "constant argument" for
optimization.
* 0002: This hides NextCopyFromRawFields() in a public
header because it's not used in PostgreSQL and we want to
use "static inline" for it. If it's a problem, we can keep
it and create an internal function for "static inline".
* 0003: We use "CREATE FUNCTION" to register a custom COPY
FROM/TO handler. It's the same approach as tablesample.
* 0004 and 0005: We can mix them but this patch set split
them for easy to review. 0004 just moves the existing
codes. It doesn't change the existing codes.
* PoC: I provide it as a separated repository instead of a
patch because an extension exists as a separated project
in general. If it's a problem, I can provide it as a patch
for contrib/.
* This patch set still has minimal Copy{From,To}Routine. For
example, custom COPY FROM/TO handlers can't process their
own options with this patch set. We may add more callbacks
to Copy{From,To}Routine later based on real world use-cases.

Performance concern:

We have a benchmark result and a profile for the change that
uses Copy{From,To}Routine for the existing formats. [8] They
are based on the v15 patch but there are no significant
difference between the v15 patch and v18 patch set.

These results show the followings:

* Runtime: The patched version is faster than HEAD.
* The patched version: 6232ms in average
* HEAD: 6550ms in average
* Profile: The patched version spends more percents than
HEAD in a core function.
* The patched version: 85.61% in CopyOneRowTo()
* HEAD: 80.35% in CopyOneRowTo()

[8]: https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz

Here are related information for this benchmark/profile:

* Use -O2 for optimization build flag
("meson setup --buildtype=release" may be used)
* Use tmpfs for PGDATA
* Disable fsync
* Run on scissors (what is "scissors" in this context...?) [9]
* Unlogged table may be used
* Use a table that has 30 integer columns (*1)
* Use 5M rows (*2)
* Use '/dev/null' for COPY TO (*3)
* Use blackhole_am for COPY FROM (*4)
https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
* perf is used but used options are unknown (sorry)

(*1) This SQL may be used to create the table:

CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
RETURNS VOID AS
$func$
DECLARE
query text;
BEGIN
query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
FOR i IN 1..num_cols LOOP
query := query || 'a_' || i::text || ' int default 1';
IF i != num_cols THEN
query := query || ', ';
END IF;
END LOOP;
query := query || ')';
EXECUTE format(query);
END
$func$ LANGUAGE plpgsql;
SELECT create_table_cols ('to_tab_30', 30);
SELECT create_table_cols ('from_tab_30', 30);

(*2) This SQL may be used to insert 5M rows:

INSERT INTO to_tab_30 SELECT FROM generate_series(1, 5000000);

(*3) This SQL may be used for COPY TO:

COPY to_tab_30 TO '/dev/null' WITH (FORMAT text);

(*4) This SQL may be used for COPY FROM:

CREATE EXTENSION blackhole_am;
ALTER TABLE from_tab_30 SET ACCESS METHOD blackhole_am;
COPY to_tab_30 TO '/tmp/to_tab_30.txt' WITH (FORMAT text);
COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text);

[9]: https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c

Thanks,
--
kou

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-07-25 04:53:31 Re: long-standing data loss bug in initial sync of logical replication
Previous Message Alexander Lakhin 2024-07-25 04:00:00 Re: Recent 027_streaming_regress.pl hangs