From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Date: | 2023-09-09 09:00:00 |
Message-ID: | ee0e1ae4-ff12-7d56-72a8-a70e492d6287@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Thomas,
08.09.2023 22:39, Thomas Munro wrote:
>> With debugging logging added I see (on 7389aad63~1) that one process
>> really sends SIGURG to another, and the latter reaches poll(), but it
>> just got no signal, it's signal handler not called and poll() just waits...
> Thanks for working so hard on this Alexander. That is a surprising
> discovery! So changes to the signal handler arrangements in the
> *postmaster* before the child was forked affected this?
Yes, I think we deal with something like that. I can try to deduce a minimum
change that affects reproducing the issue, but may be it's not that important.
Perhaps we now should think of escalating the problem to FreeBSD developers?
I wonder, what kind of reproducer they find acceptable. A standalone C
program only or maybe a script that compiles/installs postgres and runs
our test will do too?
>> So it looks like the ARM weak memory model is not the root cause of the
>> issue. But as far as I can see, it's still specific to FreeBSD (but not
>> specific to a compiler — I used gcc and clang with the same success).
> Idea: FreeBSD 13 introduced a new mechanism called sigfastblock[1],
> which lets system libraries control signal blocking with atomic memory
> tricks in a word of user space memory. I have no particular theory
> for why it would be going wrong here (I don't expect us to be using
> any of the stuff that would use it, though I don't understand it in
> detail so that doesn't say much), but it occurred to me that all
> reports so far have been on 13.x or 14. I wonder... If you have a
> good fast recipe for reproducing this, could you also try it on
> FreeBSD 12.4?
It was a happy guess! I checked the reproduction on
FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212
and got the same results as on FreeBSD 14:
REL_12_STABLE - failed on iteration 3
REL_15_STABLE - failed on iteration 1
REL_16_STABLE - 10 iterations with no failure
But on FreeBSD 12.4-RELEASE r372781:
REL_12_STABLE - 20 iterations with no failure
REL_15_STABLE - 20 iterations with no failure
BTW, I also retested 7389aad63 on FreeBSD 14 and got no failure for 100
iterations.
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2023-09-09 11:21:21 | Re: Row pattern recognition |
Previous Message | jian he | 2023-09-09 07:54:37 | Re: SQL:2011 application time |