Re: InitControlFile misbehaving on graviton

From: Julian Andres Klode <jak(at)jak-linux(dot)org>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Christoph Berg <cb(at)df7cb(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bernd Helmle <bernd(at)oopsware(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: InitControlFile misbehaving on graviton
Date: 2025-01-18 20:38:38
Message-ID: 6fxlmnyagkycru3bewa4ympknywnsswlqzvwfft3ifqqiioxlv@ax53pv7xdrc2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 13, 2025 at 09:39:40PM +0100, Matthias van de Meent wrote:
> On Mon, 13 Jan 2025 at 20:04, Christoph Berg <cb(at)df7cb(dot)de> wrote:
> >
> > Bernd and I have been chasing a bug that happens when all of the
> > following conditions are fulfilled:
> >
> > * PG 15..18 (older PGs are ok)
> > * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> > * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
> > * -O2 (ok with -O0)
> > * --with-openssl (ok without openssl)
> > * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
> >
> > The problem happens early during initdb:
> >
> > $ ./configure --with-openssl --enable-debug
> > ...
> > $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> > ...
> > running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL: control file contains invalid database cluster state
> > child process exited with exit code 1
> > initdb: data directory "broken" not removed at user's request
>
> Yes, weird.
>
> > (gdb) disassemble
> > Dump of assembler code for function BootStrapXLOG:
> > 0x0000aaaaaac21708 <+0>: stp x29, x30, [sp, #-272]!
> > 0x0000aaaaaac2170c <+4>: mov w1, #0x0 // #0
> > 0x0000aaaaaac21710 <+8>: mov x29, sp
> > ...
> > => 0x0000aaaaaac219bc <+692>: add x19, sp, #0x90
> > 0x0000aaaaaac219c0 <+696>: mov x0, x19
> > 0x0000aaaaaac219c4 <+700>: mov x1, #0x20 // #32
> > 0x0000aaaaaac219c8 <+704>: str w2, [x21, #28]
> > 0x0000aaaaaac219cc <+708>: bl 0xaaaaab0ac824 <pg_strong_random>
>
> pg_strong_random pulls random values from openssl's RAND_bytes
> (defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
> support. If openSSL isn't enabled we instead use /dev/urandom (on
> unix-y systems), which means different code will be generated for
> pg_strong_random.
>
> > 0x0000aaaaaac219d0 <+712>: tbz w0, #0, 0xaaaaaac21b28 <BootStrapXLOG+1056>
> > 0x0000aaaaaac219d4 <+716>: ldr x3, [x22, #32]
> > 0x0000aaaaaac219d8 <+720>: mov x2, #0x128 // #296
> > 0x0000aaaaaac219dc <+724>: mov w1, #0x0 // #0
> > 0x0000aaaaaac219e0 <+728>: mov x0, x3
> > 0x0000aaaaaac219e4 <+732>: bl 0xaaaaaab7f3b0 <memset(at)plt>
>
> Given this code, it looks like register x3 contains ControlFile - it's
> being memset(..., 0, sizeof(ControlFileData));
>
> > 0x0000aaaaaac219e8 <+736>: mov x3, x0
> > 0x0000aaaaaac219ec <+740>: mov x1, #0x3e8 // #1000
> > 0x0000aaaaaac219f0 <+744>: ldr w9, [x21, #32]
> > 0x0000aaaaaac219f4 <+748>: adrp x7, 0xaaaaab3ce000 <fmgr_builtins+72112>
> > 0x0000aaaaaac219f8 <+752>: ldr x7, [x7, #3720]
> > 0x0000aaaaaac219fc <+756>: str x1, [x3, #128]
>
> ... Which would make this the assignment to unloggedLSN (which matches
> the FirstNormalUnloggedLSN=1000 stored just above)
>
> > 0x0000aaaaaac21a00 <+760>: ldr w1, [sp, #120]
> > 0x0000aaaaaac21a04 <+764>: add x0, x0, #0x28
> > 0x0000aaaaaac21a08 <+768>: str x23, [x3]
>
> And this would be the assignment of systemidentifier,
>
> > 0x0000aaaaaac21a0c <+772>: str w1, [x3, #252]
>
> ... data_checksum_version,
>
> > 0x0000aaaaaac21a10 <+776>: adrp x6, 0xaaaaab3cf000
> > 0x0000aaaaaac21a14 <+780>: ldr x6, [x6, #2392]
> > 0x0000aaaaaac21a18 <+784>: adrp x5, 0xaaaaab3cf000
> > 0x0000aaaaaac21a1c <+788>: ldr x5, [x5, #2960]
> > 0x0000aaaaaac21a20 <+792>: adrp x4, 0xaaaaab3cf000
> > 0x0000aaaaaac21a24 <+796>: ldr x4, [x4, #3352]
> > 0x0000aaaaaac21a28 <+800>: ldp q26, q25, [x19]
> > 0x0000aaaaaac21a2c <+804>: str s15, [x3, #16]
>
> ... and finally ControlFile->state.
>
> I don't see where s15 is initialized and/or written to first, but this
> is the only reference in this section of ASM. As such, I think the
> initialization (presumably, "mov s15, #1" or such) must have happened
> before the call to pg_secure_rand/RAND_bytes.
>
> Looking around on the internet, it seems that in the ARM Procedure
> Call Standard register s15 does not need to be preserved, and thus
> could be clobbered when we're going into pg_secure_rand and co. If the
> register is was indeed clobbered by OpenSSL, that would be a good
> explanation for these issues. Can you check this?
>
> > The really weird thing is that the very same binaries work on a
> > different host (arm64 VM provided by Huawei) - the
> > postgresql_arm64.deb files compiled there and present on
> > apt.postgresql.org are fine, but when installed on that graviton VM,
> > they throw the above error.
>
> If I were you, I'd start looking into the differences in behaviour of
> OpenSSL between the two ARM-based systems you mention; particularly
> with a focus on register contents. It looks like gdb's `i r ...`
> command could help out with that - or so StackOverflow tells me.

This was all very helpful and if I paid more attention I'd have seen
it sooner but here we go:

https://github.com/openssl/openssl/pull/26469

I believe this should fix your issue as well, I was debugging it
from the APT side for the past 14 hours or so.

The AES-CTR code is used by the default random number generator
to derive random numbers from the initial seed.
--
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer i speak de, en

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-01-18 21:31:25 Re: Confine vacuum skip logic to lazy_scan_skip
Previous Message Tom Lane 2025-01-18 20:37:54 Re: Adding comments to help understand psql hidden queries