pg_trgm comparison bug on cross-architecture replication due to different char implementation

From: "Guo, Adam" <adamguo(at)amazon(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: pg_trgm comparison bug on cross-architecture replication due to different char implementation
Date: 2024-04-23 14:45:20
Message-ID: CB11ADBC-0C3F-4FE0-A678-666EE80CBB07@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I would like to report an issue with the pg_trgm extension on cross-architecture replication scenarios. When an x86_64 standby server is replicating from an aarch64 primary server or vice versa, the gist_trgm_ops opclass returns different results on the primary and standby. Masahiko previously reported a similar issue affecting the pg_bigm extension [1].

To reproduce, execute the following on the x86_64 primary server:

CREATE EXTENSION pg_trgm;
CREATE TABLE tbl (c text);
CREATE INDEX ON tbl USING gist (c gist_trgm_ops);
INSERT INTO tbl VALUES ('Bóbr');

On the x86_64 primary server:

postgres=> select * from tbl where c like '%Bób%';
c
------
Bóbr
(1 row)

On the aarch64 replica server:

postgres=> select * from tbl where c like '%Bób%';
c
---
(0 rows)

The root cause looks the same as the pg_bigm issue that Masahiko reported. To compare trigrams, pg_trgm uses a numerical comparison of chars [2]. On x86_64 a char is signed by default, whereas on aarch64 it is unsigned by default. gist_trgm_ops expects the trigram list to be sorted, but due to the different signedness of chars, the sort order is broken when replicating the values across architectures.

The different sort behaviour can be demonstrated using show_trgm.

On the x86_64 primary server:

postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{0x89194c," b","br ",0x707c72,0x7f7849}
(1 row)

On the aarch64 replica server:

postgres=> SELECT show_trgm('Bóbr');
show_trgm
------------------------------------------
{" b","br ",0x707c72,0x7f7849,0x89194c}
(1 row)

One simple solution for this specific case is to declare the char signedness in the CMPPCHAR macro.

--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -42,7 +42,7 @@
typedef char trgm[3];
#define CMPCHAR(a,b) ( ((a)==(b)) ? 0 : ( ((a)<(b)) ? -1 : 1 ) )
-#define CMPPCHAR(a,b,i) CMPCHAR( *(((const char*)(a))+i), *(((const char*)(b))+i) )
+#define CMPPCHAR(a,b,i) CMPCHAR( *(((unsigned char*)(a))+i), *(((unsigned char*)(b))+i) )
#define CMPTRGM(a,b) ( CMPPCHAR(a,b,0) ? CMPPCHAR(a,b,0) : ( CMPPCHAR(a,b,1) ? CMPPCHAR(a,b,1) : CMPPCHAR(a,b,2) ) )
#define CPTRGM(a,b) do { \

Alternatively, Postgres can be compiled with -funsigned-char or -fsigned-char. I came across a code comment suggesting that this may not be a good idea in general [3].

Given that this has problem has come up before and seems likely to come up again, I'm curious what other broad solutions there might be to resolve it? Looking forward to any feedback, thanks!

Best,

Adam Guo
Amazon Web Services: https://aws.amazon.com

[1] https://osdn.net/projects/pgbigm/lists/archive/hackers/2024-February/000370.html
[2] https://github.com/postgres/postgres/blob/480bc6e3ed3a5719cdec076d4943b119890e8171/contrib/pg_trgm/trgm.h#L45
[3] https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/cash.c#L114-L123

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-04-23 14:57:36 Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation
Previous Message Tom Lane 2024-04-23 14:11:45 Re: Why does pgindent's README say to download typedefs.list from the buildfarm?