Re: speed up verifying UTF-8

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-03 19:08:57
Message-ID: c3200e58-bad2-4414-9289-62a8a3bb02b5@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/06/2021 17:33, Greg Stark wrote:
>> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> | half2). We have to compute (half1 | half2) anyway.
>
> Wouldn't you have to check (half1 & half2) ?

Ah, you're right of course. But & is not quite right either, it will
give false positives. That's ok from a correctness point of view here,
because we then fall back to checking byte by byte, but I don't think
it's a good tradeoff.

I think this works, however:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
uint64 half1,
half2,
highbits_set;
uint64 x1,
x2;
uint64 x;

if (len >= 2 * sizeof(uint64))
{
memcpy(&half1, s, sizeof(uint64));
memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

/* Check if any bytes in this chunk have the high bit set. */
highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
if (highbits_set)
return 0;

/*
* Check if there are any zero bytes in this chunk.
*
* First, add 0x7f to each byte. This sets the high bit in each byte,
* unless it was a zero. We already checked that none of the bytes had
* the high bit set previously, so the max value each byte can have
* after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
* worry about carrying over to the next byte.
*/
x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);

/* then check that the high bit is set in each byte. */
x = (x1 | x2);
x &= UINT64CONST(0x8080808080808080);
if (x != UINT64CONST(0x8080808080808080))
return 0;

return 2 * sizeof(uint64);
}
else
return 0;
}

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-06-03 19:10:35 Re: speed up verifying UTF-8
Previous Message Pavel Stehule 2021-06-03 19:06:13 Re: security_definer_search_path GUC