From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: speed up verifying UTF-8 |
Date: | 2021-06-03 19:08:57 |
Message-ID: | c3200e58-bad2-4414-9289-62a8a3bb02b5@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03/06/2021 17:33, Greg Stark wrote:
>> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> | half2). We have to compute (half1 | half2) anyway.
>
> Wouldn't you have to check (half1 & half2) ?
Ah, you're right of course. But & is not quite right either, it will
give false positives. That's ok from a correctness point of view here,
because we then fall back to checking byte by byte, but I don't think
it's a good tradeoff.
I think this works, however:
/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
uint64 half1,
half2,
highbits_set;
uint64 x1,
x2;
uint64 x;
if (len >= 2 * sizeof(uint64))
{
memcpy(&half1, s, sizeof(uint64));
memcpy(&half2, s + sizeof(uint64), sizeof(uint64));
/* Check if any bytes in this chunk have the high bit set. */
highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
if (highbits_set)
return 0;
/*
* Check if there are any zero bytes in this chunk.
*
* First, add 0x7f to each byte. This sets the high bit in each byte,
* unless it was a zero. We already checked that none of the bytes had
* the high bit set previously, so the max value each byte can have
* after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
* worry about carrying over to the next byte.
*/
x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
/* then check that the high bit is set in each byte. */
x = (x1 | x2);
x &= UINT64CONST(0x8080808080808080);
if (x != UINT64CONST(0x8080808080808080))
return 0;
return 2 * sizeof(uint64);
}
else
return 0;
}
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2021-06-03 19:10:35 | Re: speed up verifying UTF-8 |
Previous Message | Pavel Stehule | 2021-06-03 19:06:13 | Re: security_definer_search_path GUC |