Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-09-09 09:00:00
Message-ID: ee0e1ae4-ff12-7d56-72a8-a70e492d6287@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Thomas,

08.09.2023 22:39, Thomas Munro wrote:
>> With debugging logging added I see (on 7389aad63~1) that one process
>> really sends SIGURG to another, and the latter reaches poll(), but it
>> just got no signal, it's signal handler not called and poll() just waits...
> Thanks for working so hard on this Alexander. That is a surprising
> discovery! So changes to the signal handler arrangements in the
> *postmaster* before the child was forked affected this?

Yes, I think we deal with something like that. I can try to deduce a minimum
change that affects reproducing the issue, but may be it's not that important.
Perhaps we now should think of escalating the problem to FreeBSD developers?
I wonder, what kind of reproducer they find acceptable. A standalone C
program only or maybe a script that compiles/installs postgres and runs
our test will do too?

>> So it looks like the ARM weak memory model is not the root cause of the
>> issue. But as far as I can see, it's still specific to FreeBSD (but not
>> specific to a compiler — I used gcc and clang with the same success).
> Idea: FreeBSD 13 introduced a new mechanism called sigfastblock[1],
> which lets system libraries control signal blocking with atomic memory
> tricks in a word of user space memory. I have no particular theory
> for why it would be going wrong here (I don't expect us to be using
> any of the stuff that would use it, though I don't understand it in
> detail so that doesn't say much), but it occurred to me that all
> reports so far have been on 13.x or 14. I wonder... If you have a
> good fast recipe for reproducing this, could you also try it on
> FreeBSD 12.4?

It was a happy guess! I checked the reproduction on
FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212
and got the same results as on FreeBSD 14:
REL_12_STABLE - failed on iteration 3
REL_15_STABLE - failed on iteration 1
REL_16_STABLE - 10 iterations with no failure

But on FreeBSD 12.4-RELEASE r372781:
REL_12_STABLE - 20 iterations with no failure
REL_15_STABLE - 20 iterations with no failure

BTW, I also retested 7389aad63 on FreeBSD 14 and got no failure for 100
iterations.

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2023-09-09 11:21:21 Re: Row pattern recognition
Previous Message jian he 2023-09-09 07:54:37 Re: SQL:2011 application time