BUG #16582: Logical index corruption leading to apparent index scan infinite loop

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: jlucasdba(at)gmail(dot)com
Subject: BUG #16582: Logical index corruption leading to apparent index scan infinite loop
Date: 2020-08-14 20:50:49
Message-ID: 16582-4945b19d95e41c46@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16582
Logged by: James Lucas
Email address: jlucasdba(at)gmail(dot)com
PostgreSQL version: 11.5
Operating system: Centos 7
Description:

I have an 11.5 database with a declaratively partitioned table. The table
has around 40 partitions. We began observing a number of long running
sessions (over a week), all executing the same select query against this
table. Attempting to kill the sessions with pg_terminate_backend has no
effect.

I pulled several stack traces against the backend processes executing this
query and traced the issue back to the _bt_moveright function in
src/backend/access/nbtree/nbtsearch.c. This function appears to be
implemented as an infinite loop, which breaks when it finds a page that
matches what it's looking for. So my working theory is that I have a
corrupt index where that condition is never properly satisfied, resulting in
an endless loop.

I found a thread in the mailing list archive about a similar issue in one of
the other functions in this file
(https://www.postgresql.org/message-id/flat/CAM3SWZT7dCes%3DuOA3NAHYBA1kth%3Db4pXkszNLMPVtNAAYUp_wg%40mail.gmail.com#e9bf163da8f50f0f2845b59250da6ab5)
The _bt_moveright function does not have CHECK_FOR_INTERRUPTS in its loop,
which would explain why the sessions don't respond to
pg_terminate_backend.

The table has two indexes, so I decided to scan both indexes on all
partitions with the bt_index_check function from the amcheck extension. I
identified one partition where both indexes throw the following result:
ERROR: cross page item order invariant violated for index "xxxxx"
DETAIL: Last item on page tid(xx,xx) page lsn=xxxxxxxxxx

Since both indexes appear to have the same issue, it suggests to me
something about this table/data may be triggering a bug. I have not yet
tried reindexing. I did pg_dump/pg_restore the entire table (all
partitions) to another system (also 11.5), and the indexes built there test
okay.

Other possibly relevant information: The table is range partitioned on an
insert timestamp (timestamp(0) without timezone). The two indexes involved
are both composite indexes, over columns of types (numeric, numeric,
numeric) and (numeric, timestamp(6) without time zone). The partition key
is not covered by either index. There is also no primary key.

I realize it may not be possible to remote diagnose the root cause of the
index issues, but figured it was worth reporting. My next steps would
probably be to try reindexing the affected partition. If that doesn't work,
it seems like either a vacuum full or pg_dump/pg_restore would be the next
step. Any other suggestions appreciated.

Thanks,
James

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2020-08-14 21:32:57 Re: BUG #16582: Logical index corruption leading to apparent index scan infinite loop
Previous Message Noah Misch 2020-08-14 07:51:11 Re: BUG #16508: using multi-host connection string when the first host is starting fails