Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joe Conway <mail(at)joeconway(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "Guo, Adam" <adamguo(at)amazon(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Jim Mlodgenski <jimmy76(at)gmail(dot)com>
Subject: Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation
Date: 2024-10-01 18:55:48
Message-ID: CAD21AoDYBNG-EsgDy-sMiT6cinmrvsKe6N5BVOhizUb1k_ta2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 30, 2024 at 11:49 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Wed, Sep 25, 2024 at 03:46:27PM -0700, Masahiko Sawada wrote:
> > On Wed, Sep 18, 2024 at 6:46 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
> > > > On Mon, Sep 16, 2024 at 9:24 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > > On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> > > > > > On Tue, Sep 10, 2024 at 3:05 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > > > > > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > > > > > > Got it. So now I'm wondering if we need all the complexity of storing
> > > > > > > > stuff in the GIN metapages. Could we simply read the (primary's)
> > > > > > > > signedness out of pg_control and use that?
>
> > > > Thanks. I like the simplicity of this approach. If we agree with this
> > > > approach, I'd like to proceed with it.
> > >
> > > Works for me.
> > >
> > > > Regardless of what approach we take, I wanted to provide some
> > > > regression tests for these changes, but I could not come up with a
> > > > reasonable idea. It would be great if we could do something like
> > > > 027_stream_regress.pl on cross-architecture replication. But just
> > > > executing 027_stream_regress.pl on cross-architecture replication
> > > > could not be sufficient since we would like to confirm query results
> > > > as well. If we could have a reasonable tool or way, it would also help
> > > > find other similar bugs related architectural differences.
> > >
> > > Perhaps add a regress.c function that changes the control file flag and
> > > flushes the change to disk?
> >
> > I've tried this idea but found out that extensions can flush the
> > controlfile on the disk after flipping the char signedness value, they
> > cannot update the controlfile data on the shared memory. And, when the
> > server shuts down, the on-disk controlfile is updated with the
> > on-memory controlfile data, so the char signedness is reverted.
> >
> > We can add a function to set the char signedness of on-memory
> > controlfile data or add a new option to pg_resetwal to change the char
> > signedness on-disk controlfile data but they seem to be overkill to me
> > and I'm concerned about misusing these option and function.
>
> Before the project is over, pg_upgrade will have some way to set the control
> file signedness to the source cluster signedness. I think pg_upgrade should
> also take an option for overriding the source cluster signedness for source
> clusters v17 and older. That way, a user knowing they copied the v17 source
> cluster from x86 to ARM can pg_upgrade properly without the prerequisite of
> acquiring an x86 VM.

Good idea.

> Once that all exists, perhaps it will be clearer how
> best to change signedness for testing.

Agreed.

>
> > While this change does not encourage the use of 'char' without
> > explicitly specifying its signedness, it provides a hint to ensure
> > consistent behavior for indexes (e.g., GIN and GiST) and tables that
> > already store data sorted by the 'char' type on disk, especially in
> > cross-platform replication scenarios.
>
> Let's put that sort of information in the GetDefaultCharSignedness() header
> comment. New code should use explicit "signed char" or "unsigned char" when
> it matters for data file compatibility. GetDefaultCharSignedness() exists for
> code that deals with pre-v18 data files, where we accidentally let C
> implementation signedness affect persistent data.

Agreed, will fix.

>
> > @@ -4256,6 +4257,18 @@ WriteControlFile(void)
> >
> > ControlFile->float8ByVal = FLOAT8PASSBYVAL;
> >
> > + /*
> > + * The signedness of the char type is implementation-defined. For example
> > + * on x86 architecture CPUs, the char data type is typically treated as
> > + * signed by default whereas on aarch architecture CPUs it is typically
> > + * treated as unsigned by default.
> > + */
> > +#if CHAR_MIN != 0
> > + ControlFile->default_char_signedness = true;
> > +#else
> > + ControlFile->default_char_signedness = false;
> > +#endif
>
> This has initdb set the field to the current C implementation's signedness. I
> wonder if, instead, initdb should set signedness=true unconditionally. Then
> the only source of signedness=false will be pg_upgrade.
>
> Advantage: signedness=false will get rarer over time. If we had known about
> this problem during the last development cycle that forced initdb (v8.3), we
> would have made all clusters signed or all clusters unsigned. Making
> pg_upgrade the only source of signedness=false will cause the population of
> database clusters to converge toward that retrospective ideal.

I think it's a good idea. Being able to treat one case as a rarer one
rather than treating both cases equally may have various advantages in
the future, for example when developing or investigating a problem.

> Disadvantage: testing signedness=false will require more planning.

If testing signedness=false always requires pg_upgrade, there might be
some cumbersomeness. Inventing a testing-purpose-only tool (e.g., a
CLI program) that changes the signedness might make tests easier.

> What other merits should we consider as part of deciding?

Considering that the population of database cluster signedness will
converge to signedness=true in the future, we can consider using
-fsigned-char to prevent similar problems for the future. We need to
think about possible side-effects as well, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-10-01 19:00:00 Re: query_id, pg_stat_activity, extended query protocol
Previous Message Daniel Gustafsson 2024-10-01 18:55:34 Re: Changing the state of data checksums in a running cluster