Re: Vacuum/visibility is busted

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Vacuum/visibility is busted
Date: 2013-02-07 17:32:23
Message-ID: CAMkU=1zqb0VTxbfRQqDNy4Zr5X8m0nuTa-CC6EDAO9yitpXUpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 7, 2013 at 12:55 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> On Thu, Feb 7, 2013 at 11:09 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> While stress testing Pavan's 2nd pass vacuum visibility patch, I realized
>> that vacuum/visibility was busted. But it wasn't his patch that busted it.
>> As far as I can tell, the bad commit was in the range
>> 692079e5dcb331..168d3157032879
>>
>> Since a run takes 12 to 24 hours, it will take a while to refine that
>> interval.
>>
>> I was testing using the framework explained here:
>>
>> http://www.postgresql.org/message-id/CAMkU=1xoA6Fdyoj_4fMLqpicZR1V9GP7cLnXJdHU+iGgqb6WUw@mail.gmail.com
>>
>> Except that I increased JJ_torn_page to 8000, so that autovacuum has a
>> chance to run to completion before each crash; and I turned off archive_mode
>> as it was not relevant and caused annoying noise. As far as I know,
>> crashing is entirely irrelevant to the current problem, but I just used and
>> adapted the framework I had at hand.
>>
>> A tarball of the data directory is available below, for those who would
>> like to do a forensic inspection. The table jjanes.public.foo is clearly in
>> violation of its unique index.
>
> The xmins of all the duplicate tuples look dangerously close to 2^31.
> I wonder if XID wrap around has anything to do with it.
>
> Index scans do not return any duplicates and you need to force a seq
> scan to see them. Assuming that the page level VM bit might be
> corrupted, I tried to REINDEX the table to see if it complains of
> unique key violations, but that crashes the server with
>
> TRAP: FailedAssertion("!(((bool) ((root_offsets[offnum - 1] !=
> ((OffsetNumber) 0)) && (root_offsets[offnum - 1] <= ((OffsetNumber)
> (8192 / sizeof(ItemIdData)))))))", File: "index.c", Line: 2482)

I don't see the assertion failure myself. If I do REINDEX INDEX it
gives a duplicate key violation, and if I do REINDEX TABLE or REINDEX
DATABASE I get claimed success.

This is using either current head (ab0f7b6) or 168d315 as binaries to
start up the cluster.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-02-07 17:35:24 Re: split rm_name and rm_desc out of rmgr.c
Previous Message Dimitri Fontaine 2013-02-07 17:21:16 Re: proposal: ANSI SQL 2011 syntax for named parameters