Re: Weird failure with latches in curculio on v15

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-22 08:48:10
Message-ID: CA+hUKGJmM+3BWUrF=CXQ8p7gaS9JcUBB3VZ_R5CT6H6iY+t92g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
> > Perhaps beginning a new thread with a patch and a summary would be
> > better at this stage? Another thing I am wondering is if it could be
> > possible to test that rather reliably. I have been playing with a few
> > scenarios like holding the system() call for a bit with hardcoded
> > sleep()s, without much success. I'll try harder on that part.. It's
> > been mentioned as well that we could just move away from system() in
> > the long-term.
>
> I'm happy to create a new thread if needed, but I can't tell if there is
> any interest in this stopgap/back-branch fix. Perhaps we should just jump
> straight to the long-term fix that Thomas is looking into.

Unfortunately the latch-friendly subprocess module proposal I was
talking about would be for 17. I may post a thread fairly soon with
design ideas + list of problems and decision points as I see them, and
hopefully some sketch code, but it won't be a proposal for [/me checks
calendar] next week's commitfest and probably wouldn't be appropriate
in a final commitfest anyway, and I also have some other existing
stuff to clear first. So please do continue with the stopgap ideas.

BTW Here's an idea (untested) about how to reproduce the problem. You
could copy the source from a system() implementation, call it
doomed_system(), and insert kill(-getppid(), SIGQUIT) in between
sigprocmask(SIG_SETMASK, &omask, NULL) and exec*(). Parent and self
will handle the signal and both reach the proc_exit().

The systems that failed are running code like this:

https://github.com/openbsd/src/blob/master/lib/libc/stdlib/system.c
https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/lib/libc/stdlib/system.c

I'm pretty sure these other implementations could fail in just the
same way (they restore the handler before unblocking, so can run it
just before exec() replaces the image):

https://github.com/freebsd/freebsd-src/blob/main/lib/libc/stdlib/system.c
https://github.com/lattera/glibc/blob/master/sysdeps/posix/system.c

The glibc one is a bit busier and, huh, has a lock (I guess maybe
deadlockable if proc_exit() also calls system(), but hopefully it
doesn't), and uses fork() instead of vfork() but I don't think that's
a material difference here (with fork(), parent and child run
concurrently, while with vfork() the parent is suspended until the
child exists or execs, and then processes its pending signals).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2023-02-22 08:51:32 Re: Amcheck verification of GiST and GIN
Previous Message Masahiko Sawada 2023-02-22 08:29:23 Re: [PoC] Improve dead tuple storage for lazy vacuum