From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Vadim Mikheev <vmikheev(at)sectorbase(dot)com> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Assuming that TAS() will succeed the first time is verboten |
Date: | 2000-12-28 20:54:50 |
Message-ID: | 5926.978036890@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I have been digging into the observed failure
FATAL: Checkpoint lock is busy while data base is shutting down
on some Alpha machines. It apparently doesn't happen on all Alphas,
but it's quite reproducible on some of them.
The bottom line turns out to be that on the Alpha hardware, it is
possible for TAS() to fail even when the lock is initially zero,
because that hardware's locking protocol will fail to acquire the
lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be
called in a retry loop on Alphas. Thus, the coding presently in
xlog.c,
while (TAS(&(XLogCtl->chkp_lck)))
{
struct timeval delay = {2, 0};
if (shutdown)
elog(STOP, "Checkpoint lock is busy while data base is shutting down");
(void) select(0, NULL, NULL, NULL, &delay);
}
is no good because it does not allow for multiple retries.
Offhand I see no good reason why the above-quoted code isn't just
S_LOCK(&(XLogCtl->chkp_lck));
and propose to fix this problem by reducing it to that. If the lock
is held when it shouldn't be, we'll fail with a stuck-spinlock error.
It also bothers me that xlog.c contains several places where there is a
potentially infinite wait for a lock. It seems to me that these should
time out with stuck-spinlock messages. Do you object to such a change?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Patrick Dunford | 2000-12-28 20:56:57 | Connecting across internet |
Previous Message | Peter Eisentraut | 2000-12-28 17:57:26 | Re: configure in snapshout == configure.in |