Re: Pre-proposal: unicode normalized text

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pre-proposal: unicode normalized text
Date: 2023-10-03 23:01:16
Message-ID: ZRydPDKCwiWnnk6X@ubby21
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:
> On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> > Ugh, My client is not displying 'a' correctly
>
> Ugh. Is that an argument in favor of normalization or against?

Heheh, well, it's an argument in favor of more software getting this
right (darn it).

It's also an argument for building a time machine so HFS+ can just
always have used NFC. But the existence of UTF-16 is proof that time
machines don't exist (or that only bad actors have them).

> I've also noticed that some fonts render the same character a bit
> differently depending on the constituent code points. For instance, if
> the accent is its own code point, it seems to be more prominent than if
> a single code point represents both the base character and the accent.
> That seems to be a violation, but I can understand why that might be
> useful.

Yes, that happens. Did you know that the ASCII character set was
designed with overstrike in mind for typing of accented Latin
characters? Unicode combining sequences are kinda like that, but more
complex.

Yes, the idea really was that you could write a<BS>' (or '<BS>a) to get á.
That's how people did it with typewriters anyways.

> > Almost every Latin input mode out there produces precomposed
> > characters and so they effectively produce NFC.
>
> The problem is not the normal case, the problem will be things like
> obscure input methods, some kind of software that's being too clever,
> or some kind of malicious user trying to confuse the database.

_HFS+ enters the chat_

> > That means that indices
> > need to normalize strings, but tables need to store unnormalized
> > strings.
>
> That's an interesting idea. Would the equality operator normalize
> first, or are you saying that the index would need to recheck the
> results?

You can optimize this to avoid having to normalize first. Most strings
are not equal, and they tend to differ early. And most strings will
likely be ASCII-mostly or in the same form anyways. So you can just
walk a cursor down each string looking at two bytes, and if they are
both ASCII then you move each cursor forward by one byte, and if then
are not both ASCII then you take a slow path where you normalize one
grapheme cluster at each cursor (if necessary) and compare that. (ZFS
does this.)

You can also assume ASCII-mostly, load as many bits of each string
(padding as needed) as will fit in SIMD registers, compare and check
that they're all ASCII, and if not then jump to the slow path.

You can also normalize one grapheme cluster at a time when hashing
(e.g., for hash indices), thus avoiding a large allocation if the string
is large.

Nico
--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-10-03 23:05:32 Re: Improve WALRead() to suck data directly from WAL buffers when possible
Previous Message Jeff Davis 2023-10-03 22:55:32 Re: Pre-proposal: unicode normalized text