From: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: speed up verifying UTF-8 |
Date: | 2021-07-15 05:09:48 |
Message-ID: | CAJ3gD9ejC+puY=Lgco2SGyD4tR46kye7qLZoskW0PXumtLcCpQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 13 Jul 2021 at 01:15, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do this for check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not enough to advance the pointer by the stride length.
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.
- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.
- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:
int pg_utf8_verifychar_32(uint32 chunk)
{
int len, l;
for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
{
/* Is 2-byte lead */
if ((chunk & 0xF0000000) == 0xC0000000)
{
l = 2;
/* ....... ....... */
}
/* Is 3-byte lead */
else if ((chunk & 0xF0000000) == 0xE0000000)
{
l = 3;
if (len < l)
break;
/* b2 and b3 should be continuation bytes */
if ((chunk & 0x00C0C000) != 0x00808000)
return sizeof(chunk) - len;
switch (chunk & 0xFF200000)
{
/* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx
* i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
* 10xx since it's a continuation char. Additionally
condition b2 <=
* 0x9F means it is of the form 100x.xxxx. i.e.
either 1000.xxxx
* or 1001.xxxx. So just verify that it is xx0x.xxxx
*/
case 0xE0000000:
return sizeof(chunk) - len;
/* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx
* i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
* 1010.xxxx, 1011.xxxx, 1100.xxxx, or 1110.xxxx. Last
two are not
* possible because b2 is a continuation char. So it has to be
* first two. So just verify that it is xx1x.xxxx
*/
case 0xED200000:
return sizeof(chunk) - len;
default:
;
}
}
/* Is 4-byte lead */
else if ((chunk & 0xF0000000) == 0xF0000000)
{
/* ......... */
l = 4;
}
else
return sizeof(chunk) - len;
}
return sizeof(chunk) - len;
}
From | Date | Subject | |
---|---|---|---|
Next Message | osumi.takamichi@fujitsu.com | 2021-07-15 05:50:45 | RE: logical replication empty transactions |
Previous Message | Kyotaro Horiguchi | 2021-07-15 04:51:31 | Re: ERROR: "ft1" is of the wrong type. |