From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Perform COPY FROM encoding conversions in larger chunks |
Date: | 2021-01-28 11:36:04 |
Message-ID: | 06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
>
> 0001 through 0003 are straightforward, and I think they can be committed
> now if you like.
Thanks for the review!
I did some more rigorous microbenchmarking of patch 1 and 2. I used the
attached test script, which calls convert_from() function to perform
UTF-8 verification on two large strings, about 60kb each. One of the
strings is pure ASCII, and the other is an HTML page that contains a mix
of ASCII and multibyte characters.
Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)
| mixed | ascii
-----------+-------+-------
master | 1866 | 1250
patch 1 | 959 | 507
patch 1+2 | 1396 | 987
So, the first patch,
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge
difference. Even with pure ASCII input. That's very surprising, because
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().
Even more surprising was that the second patch
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
actually made things worse again. I thought it would give a modest gain,
but nope.
It seems to me that GCC is not doing good job at optimizing the loop in
pg_verify_mbstr(). The first patch fixes that, but the second patch
somehow trips up GCC again.
So I also tried this with "gcc -O3" and clang:
Compiled with "gcc -O3"
| mixed | ascii
-----------+-------+-------
master | 1522 | 1225
patch 1 | 753 | 507
patch 1+2 | 868 | 507
Compiled with "clang -O2", Debian clang version 11.0.1-2
| mixed | ascii
-----------+-------+-------
master | 1257 | 520
patch 1 | 899 | 507
patch 1+2 | 884 | 508
With gcc -O3, the results are a better, but still the second patch seems
harmful. With clang, I got the result I expected: Almost no difference
with pure-ASCII input, because there's already a fast-path for that, and
a nice speedup with multibyte characters. Still, I was surprised how big
the speedup from the first patch was, and how little additional gain the
second patch gives.
Based on these results, I'm going to commit the first patch, but not the
second one. There are much faster UTF-8 verification routines out there,
using SIMD instructions and whatnot, and we should consider adopting one
of those, but that's future work.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
mbverifystr-speed.sql | application/sql | 942 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2021-01-28 11:37:46 | Commitfest 2021-01 ends in 3 days |
Previous Message | Hou, Zhijie | 2021-01-28 11:30:43 | RE: Determine parallel-safety of partition relations for Inserts |