From: | Sutou Kouhei <kou(at)clear-code(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Make COPY format extendable: Extract COPY TO format implementations |
Date: | 2023-12-04 06:35:48 |
Message-ID: | 20231204.153548.2126325458835528809.kou@clear-code.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.
Background:
Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:
* 2023-11: JSON and JSON lines [1]
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]
(FYI: I want to add support for Apache Arrow.)
There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.
But it seems that nobody works on this yet. So I want to
work on this. (If there is anyone who wants to work on this
together, I'm happy.)
Summary:
The attached patch introduces CopyToFormatOps struct that is
similar to TupleTableSlotOps for TupleTableSlot but
CopyToFormatOps is for COPY TO format. CopyToFormatOps has
routines to implement a COPY TO format.
The attached patch doesn't change:
* the current behavior (all existing tests are still passed
without changing them)
* the existing "text", "csv" and "binary" format output
implementations including local variable names (the
attached patch just move them and adjust indent)
* performance (no significant loss of performance)
In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?
TODOs that should be done in subsequent patches:
* Add some CopyToState readers such as CopyToStateGetDest(),
CopyToStateGetAttnums() and CopyToStateGetOpts()
(We will need to consider which APIs should be exported.)
(This is for implemeing COPY TO format by extension.)
* Export CopySend*() in src/backend/commands/copyto.c
(This is for implemeing COPY TO format by extension.)
* Add API to register a new COPY TO format implementation
* Add "CREATE XXX" to register a new COPY TO format (or COPY
TO/FROM format) implementation
("CREATE COPY HANDLER" was suggested in [5].)
* Same for COPY FROM
Performance:
We got a consensus about making COPY format extendable but
we should care about performance. [6]
> I think that step 1 ought to be to convert the existing
> formats into plug-ins, and demonstrate that there's no
> significant loss of performance.
So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.
Data: Random 32 bit integers:
CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});
The number of records: 100K, 1M and 10M
100K without this change:
format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806
100K with this change:
format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705
1M without this change:
format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687
1M with this change:
format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164
10M without this change:
format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817
10M with this change:
format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617
[1]: https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5
[2]: https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
[4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d
[5]: https://www.postgresql.org/message-id/20180211211235.5x3jywe5z3lkgcsr%40alap3.anarazel.de
[6]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us
Thanks,
--
kou
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Extract-COPY-TO-format-implementations.patch | text/x-patch | 17.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | zhihuifan1213 | 2023-12-04 06:37:02 | Avoid detoast overhead when possible |
Previous Message | John Naylor | 2023-12-04 06:34:25 | Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15) |