Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-12-13 15:39:37
Message-ID: CAFBsxsHG=g6W8Mie+_NO8dV6O0pO2stxrnS=me5ZmGqk--fd5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 10, 2021 at 2:33 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> I had another look at this now. Looks good, just a few minor comments below:

Thanks for reviewing! I've attached v25 to address your points.

> This function assumes that the input len is a multiple of 8. There's an
> assertion for that, but it would be good to also mention it in the
> function comment. I took me a moment to realize that.

Done.

> Given that assumption, I wonder if "while (len >= 0)" would marginally
> faster. Or compute "s_end = s + len" first, and check for "while (s <
> s_end)", so that you don't need to update 'len' in the loop.

With two chunks, gcc 4.8.5/11.2 and clang 12 will unroll the inner
loop, so it doesn't matter:

L51:
mov rdx, QWORD PTR [rdi]
mov rsi, QWORD PTR [rdi+8]
lea rax, [rdx+rbx]
lea rbp, [rsi+rbx]
and rax, rbp
and rax, r11
cmp rax, r11
jne .L66
or rdx, rsi
test rdx, r11
jne .L66
sub r8d, 16 ; refers to "len" in the caller
pg_utf8_verifystr()
add rdi, 16
cmp r8d, 15
jg .L51

I *think* these are the same instructions as from your version from
some time ago that handled two integers explicitly -- I rewrote it
like this to test different chunk sizes.

(Aside on 32-byte strides: Four chunks was within the noise level of
two chunks on the platform I tested. With 32 bytes, that increases the
chance that a mixed input would have non-ascii and defeat this
optimization, so should be significantly faster to make up for that.
Along those lines, in the future we could consider SSE2 (unrolled 2 x
16 bytes) for this path. Since it's part of the spec for x86-64, we
wouldn't need a runtime check -- just #ifdef it inline. And we could
piggy-back on the CRC SSE4.2 configure test for intrinsic support, so
that would avoid adding a bunch of complexity.)

That said, I think your suggestions are better on code clarity
grounds. I'm on the fence about "while(s < s_end)", so I went with
"while (len > 0)" because it matches the style in wchar.c.

> Also would be good to mention what exactly the return value means. I.e
> "returns false if the input contains any bytes with the high-bit set, or
> zeros".

Done.

> > + /*
> > + * Check if any high bits in the zero accumulator got cleared.
> > + *
> > + * XXX: As noted above, the zero check is only valid if the chunk had no
> > + * high bits set. However, the compiler may perform these two checks in
> > + * any order. That's okay because if any high bits were set, we would
> > + * return false regardless, so invalid results from the zero check don't
> > + * matter.
> > + */
> > + if (zero_cum != UINT64CONST(0x8080808080808080))
> > + return false;

> I don't understand the "the compiler may perform these checks in any
> order" comment. We trust the compiler to do the right thing, and only
> reorder things when it's safe to do so. What is special here, why is it
> worth mentioning here?

Ah, that's a good question, and now that you mention it, the comment
is silly. When looking at the assembly output a while back, I was a
bit astonished that it didn't match my mental model of what was
happening, so I made this note. I've removed the whole XXX comment
here and expanded the first comment in the loop to:

/*
* Capture any zero bytes in this chunk.
*
* First, add 0x7f to each byte. This sets the high bit in each byte,
* unless it was a zero. If any resulting high bits are zero, the
* corresponding high bits in the zero accumulator will be cleared.
*
* If none of the bytes in the chunk had the high bit set, the max
* value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
* and we don't need to worry about carrying over to the next byte. If
* any input bytes did have the high bit set, it doesn't matter
* because we check for those separately.
*/

> > @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
> > return s - start;
> > }
> >
> > -static int
> > +static pg_noinline int
> > pg_utf8_verifychar(const unsigned char *s, int len)
> > {
> > int l;
>
> Why force it to not be inlined?

Since the only direct caller is now only using it for small inputs, I
thought about saving space, but it's not enough to matter, so I'll go
ahead and leave it out. While at it, I removed the unnecessary
"inline" declaration for utf8_advance(), since the compiler can do
that anyway.

> > + * In a shift-based DFA, the input byte is an index into array of integers
> > + * whose bit pattern encodes the state transitions. To compute the current
> > + * state, we simply right-shift the integer by the current state and apply a
> > + * mask. In this scheme, the address of the transition only depends on the
> > + * input byte, so there is better pipelining.
>
> Should be "To compute the *next* state, ...", I think.

Fixed.

> The way the state transition table works is pretty inscrutable. That's
> understandable, because the values were found by an SMT solver, so I'm
> not sure if anything can be done about it.

Do you mean in general, or just the state values?

Like any state machine, the code is simple, and the complexity is
hidden in the data. Hopefully the first link I included in the comment
is helpful.

The SMT solver was only needed to allow 32-bit (instead of 64-bit)
entries in the transition table, so it's not strictly necessary. A
lookup table that fits in 1kB is nice from a cache perspective,
however.

With 64-bit, the state values are less weird-looking but they're still
just arbitrary numbers. As long as ERR = 0 and the largest is at most
9, it doesn't matter what they are, so I'm not sure it's much less
mysterious. You can see the difference between 32-bit and 64-bit in
[1].

--
In addition to Heikki's. review points, I've made a couple small
additional changes from v24: I rewrote this part, so we don't need
these macros anymore:

- if (!IS_HIGHBIT_SET(*s) ||
- IS_UTF8_2B_LEAD(*s) ||
- IS_UTF8_3B_LEAD(*s) ||
- IS_UTF8_4B_LEAD(*s))
+ if (!IS_HIGHBIT_SET(*s) || pg_utf_mblen(s) > 1)

And I moved is_valid_ascii() to pg_wchar.h so it can be used
elsewhere. I'm not sure there's a better place to put it. I tried
using this for text_position(), for which I'll start a new thread.

[1] https://www.postgresql.org/message-id/attachment/125672/v22-addendum-32-bit-transitions.txt

--
John Naylor
EDB: http://www.enterprisedb.com

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v25-0001-Add-fast-path-for-validating-UTF-8-text.patch application/x-patch 23.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-12-13 15:45:16 Re: Failed transaction statistics to measure the logical replication progress
Previous Message Robert Haas 2021-12-13 15:29:01 Re: Remove pg_strtouint64(), use strtoull() directly