Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
Date: 2017-10-30 22:34:05
Message-ID: 11584.1509402845@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Hmm. The index offnum being complained of is one past the end of the
> lp array. I think I see what about that commit changed the behavior:
> the old code for PageIndexDeleteNoCompact never changed the length
> of the lp array, except in the corner case where the page is becoming
> completely empty. The new code will shorten the lp array (decrease
> phdr->pd_lower) if it's told to remove the last item. So if you make
> two successive calls specifying the same offnum, and it's the last one
> on the page, the second one will fail with the symptoms we see here.
> However, so far as I can see, a sequence like that would have failed
> before too, just with a different error message, because once the
> first call had marked the item unused, the second call would not see
> it as a candidate to match. So I'm not quite sure how that's related
> ... but it seems like it must be.

I'm still confused about why it didn't fail before, but after adding
some additional code to log each call of PageIndexTupleDeleteNoCompact,
I think I've got a smoking gun:

2017-10-30 18:18:44.321 EDT [10932] LOG: deleting tuple 292 (of 292) in rel brin_test_c_idx page 2
2017-10-30 18:18:44.321 EDT [10932] STATEMENT: vacuum brin_test
2017-10-30 18:18:44.393 EDT [10932] LOG: deleting tuple 292 (of 292) in rel brin_test_d_idx page 2
2017-10-30 18:18:44.393 EDT [10932] STATEMENT: vacuum brin_test
2017-10-30 18:18:53.428 EDT [10932] LOG: deleting tuple 186 (of 186) in rel brin_test_e_idx page 3
2017-10-30 18:18:53.428 EDT [10932] STATEMENT: vacuum brin_test
2017-10-30 18:19:13.794 EDT [10938] LOG: deleting tuple 186 (of 186) in rel brin_test_e_idx page 4
2017-10-30 18:19:13.794 EDT [10938] STATEMENT: insert into brin_test select
mod(i,100000)/25,
mod(i,100000)/25,
mod(i,100000)/25,
mod(i,100000)/25,
md5((mod(i,100000)/25)::text)::uuid
from generate_series(1,100000) s(i)
2017-10-30 18:19:13.795 EDT [10932] LOG: deleting tuple 186 (of 185) in rel brin_test_e_idx page 4
2017-10-30 18:19:13.795 EDT [10932] STATEMENT: vacuum brin_test
2017-10-30 18:19:13.795 EDT [10932] PANIC: invalid index offnum: 186
2017-10-30 18:19:13.795 EDT [10932] STATEMENT: vacuum brin_test

So what happened here is that the inserting process decided to
summarize concurrently with the VACUUM process, and the inserting
process deleted (or maybe just updated/moved?) the placeholder tuple
that VACUUM was expecting to delete, and then we get the PANIC because
the tuple we're expecting to delete is already gone.

So: I put the blame on the fact that summarize_range() thinks that
the tuple offset it has for the placeholder tuple is guaranteed to
hold good, even across possibly-long intervals where it's holding
no lock on the containing buffer.

Fixing this without creating new problems is beyond my familiarity
with the BRIN code. It looks like it might be nontrivial :-(

regards, tom lane

Attachment Content-Type Size
hack-to-log-PageIndexTupleDeleteNoCompact.patch text/x-diff 5.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Badrul Chowdhury 2017-10-31 01:19:48 Re: Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)
Previous Message Tom Lane 2017-10-30 21:43:54 Re: Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM