Quick Links

Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails

From:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, adam(at)labkey(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
Date:	2024-11-21 15:14:23
Message-ID:	Zz9OTyWAvATeeHev@nathan
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Thu, Nov 21, 2024 at 09:47:56AM -0500, Bruce Momjian wrote:
> On Thu, Nov 21, 2024 at 02:35:50PM +0000, Bertrand Drouvot wrote:
>> On Thu, Nov 21, 2024 at 09:21:16AM -0500, Bruce Momjian wrote:
>> > I don't understand this logic. Why are two bytes important? If we knew
>> > it was UTF8 we could check for non-first bytes always starting with
>> > bits 10, but we can't know that.
>>
>> I think this is because this is a reliable way to detect if the truncation happened
>> in the middle of a character, without needing to know the specifics of the encoding.
>>
>> My understanding is that the key insight is that in any multibyte encoding, all
>> bytes within a multibyte character will have their high bits set.
>>
>> That's just my understanding from the code and Tom's previous explanations: I
>> might be wrong as not an expert in this area.
>
> But the logic doesn't make sense. Why would two bytes be any different
> than one?

Tom provided a concise explanation upthread [0]. My understanding is the
same as Bertrand's, i.e., this is an easy way to rule out a bunch of cases
where we know that we couldn't possibly have truncated in the middle of a
multi-byte character. This allows us to avoid doing multiple pg_database
lookups.

> I assumed you would just remove all trailing high-bit bytes
> and stop and the first non-high-bit byte.

I think this risks truncating more than one multi-byte character, which
would cause the login path to truncate differently than the CREATE/ALTER
DATABASE path (which is encoding-aware).

> Also, do we really expect
> there to be trailing multi-byte characters and then some ASCII before
> it? Isn't it likely it will be all ASCII or all multi-byte characters?
> I guess for Latin1, it would work fine, but I assume for Asian
> languages, it will be almost all multi-byte characters. I guess digits
> would be ASCII.

All of these seem within the realm of possibility to me.

> This all just seems very unfocused.

I see the following options:

* Try to do multibyte-aware truncation (the patch at hand).
* Only truncate for all-ASCII identifiers for historical purposes. Folks
using non-ASCII characters in database names will need to specify the
datname exactly during login.
* ERROR for long identifiers instead of automatically truncating (upthread
this was considered a non-starter since this behavior has been around for
so long).
* Revert the patch, leaving multibyte database names potentially broken
(AFAIK Bertrand's initial report is the only one).
* Do nothing, so folks who previously relied on the truncation will now
have to specify the datname exactly during login as of >= v17.

[0] https://postgr.es/m/158506.1732120196%40sss.pgh.pa.us

--
nathan

In response to

Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails at 2024-11-21 14:47:56 from Bruce Momjian

Responses

Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails at 2024-11-21 16:44:44 from Bruce Momjian

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Bruce Momjian	2024-11-21 16:44:44	Re: BUG #18711: Attempting a connection with a database name longer than 63 characters now fails
Previous Message	radagast42	2024-11-21 15:11:51	AW: AW: Wrong german error message encoding