From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: infinite loop in _bt_getstackbuf |
Date: | 2015-01-15 22:46:19 |
Message-ID: | CAM3SWZT7dCes=uOA3NAHYBA1kth=b4pXkszNLMPVtNAAYUp_wg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 30, 2014 at 10:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> (9.3.5 problem report)
I think I saw a similar issue, by a 9.3.5 instance that was affected
by the "in pg_upgrade, remove pg_multixact files left behind by
initdb" issue (I ran the remediation recommended in the 9.3.5 release
notes). Multiple anti-wraparound vacuums were stuck following a PITR.
I resolved this (as far as I can tell) by killing the autovacuum
workers, and manually running VACUUM FREEZE. I have yet to do any root
cause analysis, but I think I could reproduce the problem.
> The fundamental structure of that function is an infinite loop. We
> break out of that loop when BTEntrySame(item, &stack->bts_btentry) or
> P_RIGHTMOST(opaque) and I'm sure that it's correct to think that, in
> theory, one of those things will eventually happen.
Not in theory - only in practice. L&Y specifically state:
"We wish to point out here that our algorithms do not prevent the
possibility of livelock (where one process rrms indefinitely). This
can happen if a process never terminates because it keeps having to
follow link pointers created by other processes. This might happen in
the case of a process being run on a (relatively) very slow processor
in a multiprocessor system".
> But the index
> could be corrupted, most obviously by having a page where
> opaque->btpo_next points pack to the current block number. If that
> happens, you need an immediate shutdown (or some clever gdb hackery)
> to terminate the VACUUM. That's unfortunate and unnecessary.
Merlin reported a bug that looked exactly like this. Hardware failure
may now explain the problem.
> It also looks likes something we can fix, at a minimum by adding a
> CHECK_FOR_INTERRUPTS() at the top of that loop, or in some function
> that it calls, like _bt_getbuf(), so that if it goes into an infinite
> loop, it can at least be killed.
I think that it might be a good idea to have circular _bt_moveright()
moves (the direct offender in Merlin's case, which has very similar
logic to your _bt_getstackbuf() problem case) detected. I'm pretty
sure that it's exceptional for there to be more than 2 or 3 retries in
_bt_moveright(). It would probably be fine to consider the possibility
that we'll never finish once we get past 5 retries or something like
that. We'd then start keeping track of blocks visited, and raise an
error when a page was visited a second time.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2015-01-15 23:00:37 | Re: hung backends stuck in spinlock heavy endless loop |
Previous Message | Merlin Moncure | 2015-01-15 22:03:38 | Re: hung backends stuck in spinlock heavy endless loop |