From: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: speed up verifying UTF-8 |
Date: | 2021-06-06 19:21:51 |
Message-ID: | CAFBsxsGSnBnHfJ7D6Vs5bzYK=syCXf75-e6zLOV93AQ7hTt9jg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 3, 2021 at 3:22 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 03/06/2021 22:16, Heikki Linnakangas wrote:
> > On 03/06/2021 22:10, John Naylor wrote:
> >> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi
> >> <mailto:hlinnaka(at)iki(dot)fi>> wrote:
> >> > x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >> > x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >> >
> >> > /* then check that the high bit is set in each
byte. */
> >> > x = (x1 | x2);
> >> > x &= UINT64CONST(0x8080808080808080);
> >> > if (x != UINT64CONST(0x8080808080808080))
> >> > return 0;
> If you replace (x1 | x2) with (x1 & x2) above, I think it's correct.
After looking at it again with fresh eyes, I agree this is correct. I
modified the regression tests to pad the input bytes with ascii so that the
code path that works on 16-bytes at a time is tested. I use both UTF-8
input tables for some of the additional tests. There is a de facto
requirement that the descriptions are unique across both of the input
tables. That could be done more elegantly, but I wanted to keep things
simple for now.
v11-0001 is an improvement over v10:
clang 12.0.5 / MacOS:
master:
chinese | mixed | ascii
---------+-------+-------
975 | 686 | 369
v10-0001:
chinese | mixed | ascii
---------+-------+-------
930 | 549 | 109
v11-0001:
chinese | mixed | ascii
---------+-------+-------
687 | 440 | 64
gcc 4.8.5 / Linux (older machine)
master:
chinese | mixed | ascii
---------+-------+-------
2559 | 1495 | 825
v10-0001:
chinese | mixed | ascii
---------+-------+-------
2966 | 1034 | 156
v11-0001:
chinese | mixed | ascii
---------+-------+-------
2242 | 824 | 140
Previous testing on POWER8 and Arm64 leads me to expect similar results
there as well.
I also looked again at 0002 and decided I wasn't quite happy with the test
coverage. Previously, the code padded out a short input with ascii so that
the 16-bytes-at-a-time code path was always exercised. However, that
required some finicky complexity and still wasn't adequate. For v11, I
ripped that out and put the responsibility on the regression tests to make
sure the various code paths are exercised.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch | application/octet-stream | 21.6 KB |
v11-0002-Use-SSE-instructions-for-pg_utf8_verifystr-where.patch | application/octet-stream | 49.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2021-06-06 19:47:16 | Re: list of extended statistics on psql (\dX) |
Previous Message | Tom Lane | 2021-06-06 19:17:36 | Re: PoC/WIP: Extended statistics on expressions |