Quick Links

Re: speed up verifying UTF-8

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: speed up verifying UTF-8
Date:	2021-06-10 12:45:01
Message-ID:	CAFBsxsH5=FDTHfS8j7Nfn6whzdiayXxqoykAXsNzegm=-iRe+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jun 9, 2021 at 7:02 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> What is the worst case scenario for this algorithm? Something where the
> new fast ASCII check never helps, but is as fast as possible with the
> old code. For that, I added a repeating pattern of '123456789012345ä' to
> the test set (these results are from my Intel laptop, not the raspberry
pi):
>
> Master:
>
> chinese | mixed | ascii | mixed2
> ---------+-------+-------+--------
> 1333 | 757 | 410 | 573
> (1 row)
>
> v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:
>
> chinese | mixed | ascii | mixed2
> ---------+-------+-------+--------
> 942 | 470 | 66 | 1249
> (1 row)

I get a much smaller regression on my laptop with clang 12:

master:

chinese | mixed | ascii | mixed2
---------+-------+-------+--------
978 | 685 | 370 | 452

v11-0001:

chinese | mixed | ascii | mixed2
---------+-------+-------+--------
686 | 438 | 64 | 595

> So there's a regression with that input. Maybe that's acceptable, this
> is the worst case, after all. Or you could tweak check_ascii for a
> different performance tradeoff, by checking the two 64-bit words
> separately and returning "8" if the failure happens in the second word.

For v12 (unformatted and without 0002 rebased) I tried the following:
--
highbits_set = (half1) & UINT64CONST(0x8080808080808080);
if (highbits_set)
return 0;

x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x1 &= UINT64CONST(0x8080808080808080);
if (x1 != UINT64CONST(0x8080808080808080))
return 0;

/* now we know we have at least 8 bytes of valid ascii, so if any of these
tests fails, return that */

highbits_set = (half2) & UINT64CONST(0x8080808080808080);
if (highbits_set)
return sizeof(uint64);

x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x2 &= UINT64CONST(0x8080808080808080);
if (x2 != UINT64CONST(0x8080808080808080))
return sizeof(uint64);

return 2 * sizeof(uint64);
--
and got this:

chinese | mixed | ascii | mixed2
---------+-------+-------+--------
674 | 499 | 170 | 421

Pure ascii is significantly slower, but the regression is gone.

I used the string repeat('123456789012345ä', 3647) to match the ~62000
bytes in the other strings (62000 / 17 = 3647)

> And I haven't tried the SSE patch yet, maybe that compensates for this.

I would expect that this case is identical to all-multibyte. The worst case
for SSE might be alternating 16-byte chunks of ascii-only and chunks of
multibyte, since that's one of the few places it branches. In simdjson,
they check ascii on 64 byte blocks at a time ((c1 | c2) | (c3 | c4)) and
check only the previous block's "chunk 4" for incomplete sequences at the
end. It's a bit messier, so I haven't done it, but it's an option.

Also, if SSE is accepted into the tree, then the C fallback is only
important on platforms like PowerPC64 and Arm64, so we can make
the tradeoff by testing those more carefully. I'll test on PowerPC soon.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v12-Rewrite-pg_utf8_verifystr-for-speed.patch	application/octet-stream	20.9 KB

In response to

Re: speed up verifying UTF-8 at 2021-06-09 11:02:02 from Heikki Linnakangas

Responses

Re: speed up verifying UTF-8 at 2021-06-11 00:36:14 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bharath Rupireddy	2021-06-10 13:06:50	Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options
Previous Message	Tomas Vondra	2021-06-10 11:51:01	Re: Fix a few typos in brin_minmax_multi.c