Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

From: Noah Misch <noah(at)leadboat(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joe Conway <mail(at)joeconway(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "Guo, Adam" <adamguo(at)amazon(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jim Mlodgenski <jimmy76(at)gmail(dot)com>
Subject: Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation
Date: 2024-09-30 18:49:41
Message-ID: 20240930184941.da.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 25, 2024 at 03:46:27PM -0700, Masahiko Sawada wrote:
> On Wed, Sep 18, 2024 at 6:46 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
> > > On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> > > > > On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > > > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > > > > > Got it. So now I'm wondering if we need all the complexity of storing
> > > > > > > stuff in the GIN metapages. Could we simply read the (primary's)
> > > > > > > signedness out of pg_control and use that?

> > > Thanks. I like the simplicity of this approach. If we agree with this
> > > approach, I'd like to proceed with it.
> >
> > Works for me.
> >
> > > Regardless of what approach we take, I wanted to provide some
> > > regression tests for these changes, but I could not come up with a
> > > reasonable idea. It would be great if we could do something like
> > > 027_stream_regress.pl on cross-architecture replication. But just
> > > executing 027_stream_regress.pl on cross-architecture replication
> > > could not be sufficient since we would like to confirm query results
> > > as well. If we could have a reasonable tool or way, it would also help
> > > find other similar bugs related architectural differences.
> >
> > Perhaps add a regress.c function that changes the control file flag and
> > flushes the change to disk?
>
> I've tried this idea but found out that extensions can flush the
> controlfile on the disk after flipping the char signedness value, they
> cannot update the controlfile data on the shared memory. And, when the
> server shuts down, the on-disk controlfile is updated with the
> on-memory controlfile data, so the char signedness is reverted.
>
> We can add a function to set the char signedness of on-memory
> controlfile data or add a new option to pg_resetwal to change the char
> signedness on-disk controlfile data but they seem to be overkill to me
> and I'm concerned about misusing these option and function.

Before the project is over, pg_upgrade will have some way to set the control
file signedness to the source cluster signedness. I think pg_upgrade should
also take an option for overriding the source cluster signedness for source
clusters v17 and older. That way, a user knowing they copied the v17 source
cluster from x86 to ARM can pg_upgrade properly without the prerequisite of
acquiring an x86 VM. Once that all exists, perhaps it will be clearer how
best to change signedness for testing.

> While this change does not encourage the use of 'char' without
> explicitly specifying its signedness, it provides a hint to ensure
> consistent behavior for indexes (e.g., GIN and GiST) and tables that
> already store data sorted by the 'char' type on disk, especially in
> cross-platform replication scenarios.

Let's put that sort of information in the GetDefaultCharSignedness() header
comment. New code should use explicit "signed char" or "unsigned char" when
it matters for data file compatibility. GetDefaultCharSignedness() exists for
code that deals with pre-v18 data files, where we accidentally let C
implementation signedness affect persistent data.

> @@ -4256,6 +4257,18 @@ WriteControlFile(void)
>
> ControlFile->float8ByVal = FLOAT8PASSBYVAL;
>
> + /*
> + * The signedness of the char type is implementation-defined. For example
> + * on x86 architecture CPUs, the char data type is typically treated as
> + * signed by default whereas on aarch architecture CPUs it is typically
> + * treated as unsigned by default.
> + */
> +#if CHAR_MIN != 0
> + ControlFile->default_char_signedness = true;
> +#else
> + ControlFile->default_char_signedness = false;
> +#endif

This has initdb set the field to the current C implementation's signedness. I
wonder if, instead, initdb should set signedness=true unconditionally. Then
the only source of signedness=false will be pg_upgrade.

Advantage: signedness=false will get rarer over time. If we had known about
this problem during the last development cycle that forced initdb (v8.3), we
would have made all clusters signed or all clusters unsigned. Making
pg_upgrade the only source of signedness=false will cause the population of
database clusters to converge toward that retrospective ideal.

Disadvantage: testing signedness=false will require more planning.

What other merits should we consider as part of deciding?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-09-30 19:09:51 Re: pg_basebackup and error messages dependent on the order of the arguments
Previous Message Michał Kłeczek 2024-09-30 18:36:26 Re: SET or STRICT modifiers on function affect planner row estimates