From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: hung backends stuck in spinlock heavy endless loop |
Date: | 2015-01-22 21:50:03 |
Message-ID: | CAHyXU0x7MPmW1v1kqB5Trb_z0no5w5QpK7_qFo0CYvNngyYsbA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search. If that doesn't turn up anything
>> productive I'll look into taking other steps.
>
> That might be the quickest way to do it, provided you can isolate the
> bug fairly reliably. It might be a bit tricky to write a shell script
> that assumes a certain amount of time having passed without the bug
> tripping indicates that it doesn't exist, and have that work
> consistently. I'm slightly concerned that you'll hit other bugs that
> have since been fixed, given the large number of possible symptoms
> here.
Quick update: not done yet, but I'm making consistent progress, with
several false starts. (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the feature.).
I have to triple check everything :(. The problem is generally
reproducible but I get false negatives that throws off the bisection.
I estimate that early next week I'll have it narrowed down
significantly if not to the exact offending revision.
So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent. For example:
[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page
verification failed, calculated checksum 9465 but expected 9477 at
character 20
[cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING: page
verification failed, calculated checksum 61889 but expected 61903 at
character 20
[cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING: page
verification failed, calculated checksum 27311 but expected 27316
I'm not up on the intricacies of our checksum algorithm but this is
making me suspicious that we are looking at a improperly flipped
visibility bit via some obscure problem -- almost certainly with
vacuum playing a role. This fits the profile of catastrophic damage
that masquerades as numerous other problems. Or, perhaps, something
is flipping what it thinks is a visibility bit but on the wrong page.
I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.
In the 'plus' category, aside from flushing out this issue, I've had
zero runtime problems so far aside from the mains problem; bisection
(at least on the 'bad' side) has been reliably engaged by simply
counting the number of warnings/errors/etc in the log. That's really
impressive.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2015-01-22 21:58:17 | Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe? |
Previous Message | David G Johnston | 2015-01-22 21:46:37 | Re: Proposal: knowing detail of config files via SQL |