Re: Make COPY format extendable: Extract COPY TO format implementations

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-07-29 12:17:08
Message-ID: 26541788-8853-4d93-86cd-5f701b13ae51@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/25/24 06:51, Sutou Kouhei wrote:
> Hi,
>
> ...
>
> Here are related information for this benchmark/profile:
>
> * Use -O2 for optimization build flag
> ("meson setup --buildtype=release" may be used)
> * Use tmpfs for PGDATA
> * Disable fsync
> * Run on scissors (what is "scissors" in this context...?) [9]
> * Unlogged table may be used
> * Use a table that has 30 integer columns (*1)
> * Use 5M rows (*2)
> * Use '/dev/null' for COPY TO (*3)
> * Use blackhole_am for COPY FROM (*4)
> https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
> * perf is used but used options are unknown (sorry)
>
>
> (*1) This SQL may be used to create the table:
>
> CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
> RETURNS VOID AS
> $func$
> DECLARE
> query text;
> BEGIN
> query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
> FOR i IN 1..num_cols LOOP
> query := query || 'a_' || i::text || ' int default 1';
> IF i != num_cols THEN
> query := query || ', ';
> END IF;
> END LOOP;
> query := query || ')';
> EXECUTE format(query);
> END
> $func$ LANGUAGE plpgsql;
> SELECT create_table_cols ('to_tab_30', 30);
> SELECT create_table_cols ('from_tab_30', 30);
>
>
> (*2) This SQL may be used to insert 5M rows:
>
> INSERT INTO to_tab_30 SELECT FROM generate_series(1, 5000000);
>
>
> (*3) This SQL may be used for COPY TO:
>
> COPY to_tab_30 TO '/dev/null' WITH (FORMAT text);
>
>
> (*4) This SQL may be used for COPY FROM:
>
> CREATE EXTENSION blackhole_am;
> ALTER TABLE from_tab_30 SET ACCESS METHOD blackhole_am;
> COPY to_tab_30 TO '/tmp/to_tab_30.txt' WITH (FORMAT text);
> COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text);
>

Thanks for the benchmark instructions and updated patches. Very helpful!

I wrote a simple script to automate the benchmark - it just runs these
tests with different parameters (number of columns and number of
imported/exported rows). See the run.sh attachment, along with two CSV
results from current master and with all patches applied.

The attached PDF has a simple summary, with a median duration for each
combination, and a comparison (patched/master). The results are from my
laptop, so it's probably noisy, and it would be good to test it on a
more realistic hardware (for perf-sensitive things).

- For COPY FROM there is no difference - the results are within 1% of
master, and there's no systemic difference.

- For COPY TO it's a different story, though. There's a pretty clear
regression, by ~5%. It's a bit interesting the correlation with the
number of columns is not stronger ...

I did do some basic profiling, and the perf diff looks like this:

# Event 'task-clock:upppH'
#
# Baseline Delta Abs Shared Object Symbol

# ........ ......... .............
.........................................
#
13.34% -12.94% postgres [.] CopyOneRowTo
+10.75% postgres [.] CopyToTextOneRow
4.31% +2.84% postgres [.] pg_ltoa
10.96% +1.15% postgres [.] CopySendChar
8.68% +0.78% postgres [.] AllocSetAlloc
10.89% -0.70% postgres [.] CopyAttributeOutText
5.01% -0.47% postgres [.] enlargeStringInfo
4.95% -0.42% postgres [.] OutputFunctionCall
5.29% -0.37% postgres [.] int4out
5.90% -0.31% postgres [.] appendBinaryStringInfo
+0.29% postgres [.] CopyToStateFlush
0.27% -0.27% postgres [.] memcpy(at)plt

Not particularly surprising that CopyToTextOneRow has +11%, but that's
because it's a new function. The perf difference is perhaps due to
pg_ltoa/CopySendChar, but not sure why.

I also did some flamegraph - attached is for master, patched and diff.

It's interesting the main change in the flamegraphs is CopyToStateFlush
pops up on the left side. Because, what is that about? That is a thing
introduced in the 0005 patch, so maybe the regression is not strictly
about the existing formats moving to the new API, but due to something
else in a later version of the patch?

It would be good do run the tests for each patch in the series, and then
see when does the regression actually appear.

FWIW None of this actually proves this is an issue in practice. No one
will be exporting into /dev/null or importing into blackhole, and I'd
bet the difference gets way smaller for more realistic cases.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
copy-benchmark.pdf application/pdf 73.1 KB
master.csv text/csv 32.8 KB
patched.csv text/csv 32.8 KB
diff.svg image/svg+xml 51.7 KB
master.svg image/svg+xml 51.5 KB
patched.svg image/svg+xml 50.0 KB
run.sh application/x-shellscript 1.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Zhijie Hou (Fujitsu) 2024-07-29 12:14:48 RE: Conflict detection and logging in logical replication