Re: database vacuum from cron hanging

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: database vacuum from cron hanging
Date: 2005-10-12 14:26:26
Message-ID: 1046.1129127186@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> (gdb) p BufferDescriptors[781]
>> $1 = {tag = {rnode = {spcNode = 1663, dbNode = 16385, relNode = 2666}, blockNum = 1}, flags = 70, usage_count = 5, refcount = 4294967294,
>> wait_backend_pid = 748, buf_hdr_lock = 0 '\0', buf_id = 781, freeNext = -2, io_in_progress_lock = 1615, content_lock = 1616}

> Whoa. refcount -2?

After meditating overnight, I have a theory. There seem to be two basic
categories of possible explanations for the above state:

1. Some path of control decrements refcount more times than it increments it.
2. Occasionally, an intended increment gets lost.

Yesterday I was thinking in terms of #1, but it really doesn't seem to
fit the observed facts very well. I don't see a reason why such a bug
would preferentially affect pg_constraint_contypid_index; also it seems
like it would be fairly easily repeatable by many people. The pin
tracking logic is all internal to individual backends and doesn't look
very vulnerable to, say, timing-related glitches.

On the other hand, it's not hard to concoct a plausible explanation
using #2: suppose that two backends wanting to pin the same buffer at
about the same time pick up the same original value of refcount, add
one, store back. This is not supposed to happen of course, but maybe
the compiler is optimizing some code in a way that gives this effect
(ie, by reading refcount before the buffer header spinlock has been
acquired). Now we can account for pg_constraint_contypid_index being
hit: we know you use domains a lot, and that uncached catalog search in
GetDomainConstraints would result in a whole lot of concurrent accesses
to that particular index, so it would be a likely place for such a bug
to manifest. And we can account for you being the only one seeing it:
this theory makes it compiler- and platform-dependent.

Accordingly: what's the platform exactly? (CPU type, and OS just in
case.) What compiler was used? (If gcc, show "gcc -v" output.)
Also please show the output of "pg_config".

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-10-12 14:34:46 Re: Socket problem using beta2 on Windows-XP
Previous Message Alfranio Correia Junior 2005-10-12 14:08:59 Re: Need A Suggestion