Re: conchuela timeouts since 2021-10-09 system upgrade

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Date: 2021-10-29 11:57:25
Message-ID: 20211029115725.GA309057@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Oct 29, 2021 at 04:42:31PM +1300, Thomas Munro wrote:
> On Fri, Oct 29, 2021 at 4:20 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > It indeed is looking like 7f580aa made the problem go away on conchuela,
> > but do we understand why?

I don't.

> > The only theory I can think of is "kernel bug",
> > but while that's plausible for prairiedog it seems hard to credit for a
> > late-model BSD kernel.

DragonFly BSD is a niche OS, so I'm more willing than usual to conclude that.
Could be a bug in IPC::Run or in the port of Perl to DragonFly, but those feel
less likely than the kernel. The upgrade from DragonFly v4.4.3 to DragonFly
v6.0.0, which introduced this form of PostgreSQL test breakage, also updated
Perl from v5.20.3 to 5.32.1.

> I have yet to even log into a DBSD system (my attempt to install the
> 6.0.1 ISO on bhyve failed for lack of a driver, or something), but I
> do intend to get it working at some point. But I can offer a poorly
> researched wildly speculative hypothesis: DBSD forked from FBSD in
> 2003. macOS 10.3 took FBSD's kqueue code in... 2003. So maybe a bug
> was fixed later that they both inherited? Or perhaps that makes no
> sense, I dunno. It'd be nice to try to write a repro and send them a
> report, if we can.

The conchuela bug and the prairiedog bug both present with a timeout in
IPC::Run::finish, but the similarity ends there. On prairiedog, the
postmaster was stuck when it should have been reading a query from pgbench.
On conchuela, pgbench ran to completion and became a zombie, and IPC::Run got
stuck when it should have been reaping that zombie. Good thought, however.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-10-29 13:00:03 BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Previous Message PG Bug reporting form 2021-10-29 09:11:02 BUG #17256: Running pgagent on a custom user