From: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [POC] verifying UTF-8 using SIMD instructions |
Date: | 2021-02-13 01:31:33 |
Message-ID: | CAFBsxsEChYg6Bh4iS_Sc2Kt1N=c_92tgdkX-hABiu6SSpeEpnA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
> mixed | ascii
> -------+-------
> 447 | 46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.
Attached is a patch that does roughly what simdjson fallback did, except I
use straight tests on the bytes and only calculate code points in assertion
builds. In the course of doing this, I found that my earlier concerns about
putting the ascii check in a static inline function were due to my
suboptimal loop implementation. I had assumed that if the chunked ascii
check failed, it had to check all those bytes one at a time. As it turns
out, that's a waste of the branch predictor. In the v2 patch, we do the
chunked ascii check every time we loop. With that, I can also confirm the
claim in the Lemire paper that it's better to do the check on 16-byte
chunks:
(MacOS, Clang 10)
master:
chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366
v2 patch, with 16-byte stride:
chinese | mixed | ascii
---------+-------+-------
806 | 474 | 83
patch but with 8-byte stride:
chinese | mixed | ascii
---------+-------+-------
792 | 490 | 105
I also included the fast path in all other multibyte encodings, and that is
also pretty good performance-wise. It regresses from master on pure
multibyte input, but that case is still faster than PG13, which I simulated
by reverting 6c5576075b0f9 and b80e10638e3:
~PG13:
chinese | mixed | ascii
---------+-------+-------
1565 | 848 | 365
ascii fast-path plus pg_*_verifychar():
chinese | mixed | ascii
---------+-------+-------
1279 | 656 | 94
v2 has a rough start to having multiple implementations in
src/backend/port. Next steps are:
1. Add more tests for utf-8 coverage (in addition to the ones to be added
by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier
doesn't match the algorithm in the paper. I plan to only copy the lookup
tables from simdjson verbatim, but the code will basically be written from
scratch, using simdjson as a hint.
3. Adjust configure.ac
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v2-add-portability-stub-and-new-fallback.patch | application/octet-stream | 13.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2021-02-13 01:47:35 | Re: pg_cryptohash_final possible out-of-bounds access (per Coverity) |
Previous Message | Justin Pryzby | 2021-02-13 00:37:32 | Re: Improvements and additions to COPY progress reporting |