Re: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: "hannes(dot)janetzek(at)gmail(dot)com" <hannes(dot)janetzek(at)googlemail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)
Date: 2014-06-09 14:33:39
Message-ID: 20140609143339.GA8406@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2014-06-07 18:11:01 +0200, hannes(dot)janetzek(at)gmail(dot)com wrote:
> On Fri, Jun 6, 2014 at 6:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
> > On 2014-06-05 23:00:56 +0000, hannes(dot)janetzek(at)gmail(dot)com wrote:
> > > While trying to get our database working again after a forced shutdown
> > the
> > > reindexing of the system tables in single user mode went into an infinite
> > > loop.
> >
> > what happened in that infinite loop? The log excerpt below doesn't show
> > one? Is it constantly echoing a message?

> there were no messages from postgres while looping. I attached gdb and
> stepped through the lines and found the range around L2260-L2385 repeating.
> perf also showed only activity below IndexBuildHeapScan while tracing for a
> few minutes. From looking at the source my guess was that a tuple that is
> being indexed has a stale 'about-to-be-deleted-state'. The *very* well
> documented source states that 'we wait for the deleting transaction to
> finish and check again' I wonder if a deleting transaction can be in
> progress in single-user-mode while reindex in running - Though I really
> don't have any clue about pg internals :)

It probably wasn't actually in progress - but when you're using
fsync=off and experience a OS level crash the data directory can get
into an inconsistent state. Since transactions ids that are thought to
not yet be assigned are treated as being in progress an issue like what
you describe certainly is possible.
There's only a limited amount of defense one can build in against
basically arbitrary corruption.

> > could you explain how you got into the bad state? Are you using
> > fsync=off?

> Yes, the instance was running without fsync. We use it for rendering
> openstreetmap map tiles so content is not that critical.

I'd suggest writing off a database whose machine has crashed while being
written to when using fsync=off. The chance of hard to
diagnose/undetected corruption is just too high.

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2014-06-09 15:22:15 Re: uninterruptable loop: concurrent delete in progress within table
Previous Message Andres Freund 2014-06-09 14:24:53 Re: BUG #10533: 9.4 beta1 assertion failure in autovacuum process