Re: BUG #17245: Index corruption involving deduplicated entries

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Kamigishi Rei <iijima(dot)yun(at)koumakan(dot)jp>, David Rowley <dgrowley(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17245: Index corruption involving deduplicated entries
Date: 2021-11-12 07:47:01
Message-ID: CAFh8B=k8WtQwRGVADGCXjU_ftwbM8reTfEAdsu_1BHRH_a1tpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Peter,

On Fri, 12 Nov 2021 at 03:15, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

You have said elsewhere that you're sure that this isn't the parallel
> VACUUM bug, since you know that you didn't run a manual VACUUM, even
> once. So I wonder what the issue might be. Since you deleted duplicate
> rows from a unique index, there probably weren't very many affected
> rows in total.

It is a sharded setup, 16 shards in total. The table structure is similar
between shards.
Also, more or less the same tables were affected across all shards.
Depending on the table the number of affected rows was from one to a few
hundreds.
Picture looked similar across all shards, although, the set of tables was a
bit varying from shard to shard.
The first duplicates I resolved manually, but after rules became clear I
simply wrote queries that remove either everything but the first or the
last version of the row based on the timestamp.

> It sounds like a pretty subtle issue to me
> (particularly compared to the parallel VACUUM bug, which wasn't all
> that subtle when it hit at all).
>
> If I had to guess, I'd guess that it has something to do with the
> snapshot scalability work. Specifically, a recently reported issue
> involving confusion about the structure of HOT chains during pruning:
>
>
> https://www.postgresql.org/message-id/flat/20211110192010.ckvfzz352hsba5xf%40alap3.anarazel.de#4c3d9c9988164f5ea3c15999bcf50ce7
>
> Please join in on the other thread if you have anything more to add.
>

Unfortunately, not really. I only ran bt_index_check(index => r.oid,
heapallindexed => true), because bt_index_parent_check takes SharedLock on
the index and blocks DML, which we have a lot.
All errors looked similarly:
ERROR: heap tuple (314222,250) from table ""${table_name}"" lacks matching
index tuple within index ""${table_name}_pkey""
CONTEXT:
DETAIL: SQL statement ""SELECT bt_index_check(index => r.oid,
heapallindexed => true)""
HINT: Retrying verification using the function bt_index_parent_check()
might provide a more specific error.

>
> I could easily be wrong about that, though. You upgraded using
> pg_upgrade, right? That's certainly a confounding factor here.
>

Correct, pg_upgrade --link, from 9.6 to 14. Any other upgrade method is
resource and time consuming.

--
Alexander Kukushkin

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Sandeep Thakkar 2021-11-12 12:16:46 Re: Tenable Report Issue even after upgrading to correct Postgres version
Previous Message Kyotaro Horiguchi 2021-11-12 07:25:46 Re: BUG #17280: global-buffer-overflow on select from pg_stat_slru