From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, adam(at)labkey(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails |
Date: | 2024-11-21 14:35:50 |
Message-ID: | Zz9FRrwJRlyGBFPN@ip-10-97-1-34.eu-west-3.compute.internal |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 07:27:22AM +0000, Bertrand Drouvot wrote:
> > + /*
> > + * If the original name is too long and we see two consecutive bytes
> > + * with their high bits set at the truncation point, we might have
> > + * truncated in the middle of a multibyte character. In multibyte
> > + * encodings, every byte of a multibyte character has its high bit
> > + * set. So if IS_HIGHBIT_SET is true for both NAMEDATALEN-1 and
> > + * NAMEDATALEN-2, we know we're in the middle of a multibyte
> > + * character. We need to try truncating one more byte back to find the
> > + * start of the next character.
> > + */
> ...
> > + /*
> > + * If we've hit a byte with high bit clear (an ASCII byte), we
> > + * know we can't be in the middle of a multibyte character,
> > + * because all bytes of a multibyte character must have their
> > + * high bits set. Any following byte must therefore be the
> > + * start of a new character, so we can stop looking for
> > + * earlier truncation points.
> > + */
>
> I don't understand this logic. Why are two bytes important? If we knew
> it was UTF8 we could check for non-first bytes always starting with
> bits 10, but we can't know that.
I think this is because this is a reliable way to detect if the truncation happened
in the middle of a character, without needing to know the specifics of the encoding.
My understanding is that the key insight is that in any multibyte encoding, all
bytes within a multibyte character will have their high bits set.
That's just my understanding from the code and Tom's previous explanations: I
might be wrong as not an expert in this area.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2024-11-21 14:47:56 | Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails |
Previous Message | Daniel Gustafsson | 2024-11-21 14:33:43 | Re: BUG #18718: Incorrect Twitter/X Logo Displayed on PostgreSQL Documentation Page |