From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: subscriptionCheck failures on nightjar |
Date: | 2019-09-20 21:49:27 |
Message-ID: | 2636.1569016167@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
>> I recreated my freebsd-9-under-qemu setup and I can still reproduce
>> the problem, though not with high reliability (order of 1 time in 10).
>> Anything particular you want logged?
> A DEBUG2 log would help a fair bit, because it'd log some information
> about what changes the "horizons" determining when data may be removed.
Actually, what I did was as attached [1], and I am getting traces like
[2]. The problem seems to occur only when there are two or three
processes concurrently creating the same snapshot file. It's not
obvious from the debug trace, but the snapshot file *does* exist
after the music stops.
It is very hard to look at this trace and conclude anything other
than "rename(2) is broken, it's not atomic". Nothing in our code
has deleted the file: no checkpoint has started, nor do we see
the DEBUG1 output that CheckPointSnapBuild ought to produce.
But fsync_fname momentarily can't see it (and then later another
process does see it).
It is now apparent why we're only seeing this on specific ancient
platforms. I looked around for info about rename(2) not being
atomic, and I found this info about FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849
The reported symptom there isn't quite the same, so probably there
is another issue, but there is plenty of reason to be suspicious
that UFS rename(2) is buggy in this release. As for dromedary's
ancient version of macOS, Apple is exceedinly untransparent about
their bugs, but I found
http://www.weirdnet.nl/apple/rename.html
In short, what we got here is OS bugs that have probably been
resolved years ago.
The question is what to do next. Should we just retire these
specific buildfarm critters, or do we want to push ahead with
getting rid of the PANIC here?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2019-09-20 21:51:06 | Re: subscriptionCheck failures on nightjar |
Previous Message | Andres Freund | 2019-09-20 21:26:03 | Re: subscriptionCheck failures on nightjar |