Re: New "raw" COPY format

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com>
Cc: "jian he" <jian(dot)universality(at)gmail(dot)com>, "Tatsuo Ishii" <ishii(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New "raw" COPY format
Date: 2024-10-30 08:14:41
Message-ID: 27e746b3-ed21-47e2-9e53-94aac4cf45ef@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 29, 2024, at 17:48, Joel Jacobson wrote:
>> ---
>> +/*
>> + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
>> + */
>> +static bool
>> +CopyReadLineRawText(CopyFromState cstate)
>>
>> This function has a lot of duplication with CopyReadLineText(). I
>> think it's better to modify CopyReadLineText() to support 'raw'
>> format, rather than adding a separate function.
>
> Hmm, there is a bit of duplication, yes, but is also a hot-path,
> so I think we want to minimize branches and code size in the
> hot loop.
>
> Combining them into one function, would mean the total function
> size and branching increases for both cases.
>
> I haven't made any benchmarks on this though.

I made some benchmarks.

Integrating 'raw' into CopyReadLineText() (v17 patch) seems to cause a noticeable slowdown:

v16 = separate functions for csv/text vs raw
v17 = same function for csv/text/raw

The variance is small among the measurements, so seems significant.

However, like Tomas Vondra discovered [1], binary layout matters,
so the observed differences could be due to this, so would need to BOLT
compile, to increase the confidence.

Here is how I benchmarked:

$ cat /data/pg-dev-data/postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
max_wal_size = '10GB'
autovacuum = 'off'

$ for n in `seq 1 3` ; do dropdb "$USER" ; createdb && pg_ctl restart && psql -a -f bench.sql | grep -E '^copy log from' -A 2 | ./parse_logs.py $n "v17" >> bench.csv ; done

$ ./plot_bench.py

$ psql -f bench_result.sql

format | version | min | min_change | avg | avg_change | max | max_change | stddev
--------+---------+----------+------------+---------+------------+----------+------------+--------
csv | v16 | 3138.921 | | 3167.41 | | 3238.590 | | 28.07
csv | v17 | 3223.475 | 1.027 | 3264.23 | 1.031 | 3325.419 | 1.027 | 32.13
raw | v16 | 1989.118 | | 2018.94 | | 2092.347 | | 28.66
raw | v17 | 1999.410 | 1.005 | 2037.40 | 1.009 | 2105.216 | 1.006 | 33.38
text | v16 | 2653.829 | | 2688.66 | | 2764.434 | | 33.39
text | v17 | 2728.067 | 1.028 | 2765.92 | 1.029 | 2821.602 | 1.021 | 24.44
(6 rows)

/Joel

[1] https://vondra.me/posts/playing-with-bolt-and-postgres/

Attachment Content-Type Size
plot_bench.py text/x-python-script 864 bytes
bench.csv text/csv 4.1 KB
parse_logs.py text/x-python-script 3.8 KB
bench.sql application/octet-stream 2.8 KB
v17-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch application/octet-stream 18.8 KB
v17-0002-Add-raw-format-to-COPY-command.patch application/octet-stream 52.1 KB
v17-0003-Reorganize-option-validations.patch application/octet-stream 19.9 KB
bench_result.sql application/octet-stream 655 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-10-30 08:33:44 RE: Conflict detection for update_deleted in logical replication
Previous Message Peter Eisentraut 2024-10-30 08:09:54 Re: define pg_structiszero(addr, s, r)