Re: [POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-08 13:14:44
Message-ID: CAFBsxsHVXK_e5=yCdrB_vSUdV9waZf1tqh5SJpoPW65_Y_bm0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
> algorithm from the simdjson library [1], see attached patch. I
> microbenchmarked it using the the same test I used before [2].

I've been looking at various iterations of Lemire's utf8 code, and trying
it out was next on my list, so thanks for doing that!

> These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1
> 20210110"
>
> unpatched master:
>
> postgres=# \i mbverifystr-speed.sql
> CREATE FUNCTION
> mixed | ascii
> -------+-------
> 728 | 393
> (1 row)
>
> v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:
>
> mixed | ascii
> -------+-------
> 759 | 98
> (1 row)

Hmm, the mixed case got worse -- I haven't seen that in any of my tests.

> simdjson-utf8-hack.patch:
>
> mixed | ascii
> -------+-------
> 53 | 31
> (1 row)
>
> So clearly that algorithm is fast. Not sure if it has a high startup
> cost, or large code size, or other tradeoffs that we don't want.

The simdjson lib uses everything up through AVX512 depending on what
hardware is available. I seem to remember reading that high start-up cost
is more relevant to floating point than to integer ops, but I could be
wrong. Just the utf8 portion is surely tiny also.

> At
> least it depends on SIMD instructions, so it requires more code for the
> architecture-specific implementations and autoconf logic and all that.

One of his earlier demos [1] (in simdutf8check.h) had a version that used
mostly SSE2 with just three intrinsics from SSSE3. That's widely available
by now. He measured that at 0.7 cycles per byte, which is still good
compared to AVX2 0.45 cycles per byte [2].

Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would
assume that if that check (and the corresponding runtime check) passes, we
can assume SSE2. That code has three licenses to choose from -- Apache 2,
Boost, and MIT. Something like that might be straightforward to start
from. I think the only obstacles to worry about are license and getting it
to fit into our codebase. Adding more than zero high-level comments with a
good description of how it works in detail is also a bit of a challenge.

> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
> mixed | ascii
> -------+-------
> 447 | 46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Okay, I'll look into that.

> PS. Your patch as it stands isn't safe on systems with strict alignment,
> the string passed to the verify function isn't guaranteed to be 8 bytes
> aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Will do.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include
[2]
https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-02-08 14:03:50 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Tang, Haiying 2021-02-08 12:12:35 RE: Support tab completion for upper character inputs in psql