Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Ben Chobot <bench(at)silentmedia(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17401: REINDEX TABLE CONCURRENTLY creates a race condition on a streaming replica
Date: 2022-04-20 07:14:54
Message-ID: Yl+y7kvCDR6x4V+E@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Mar 08, 2022 at 06:44:31PM +0300, Andrey Borodin wrote:
> I tried to dig into this again. And once again I simply cannot
> understand how it is supposed to work... I'd appreciate if someone
> explained it to me.
> When I do SELECT bt_index_check('idx',true) in fact I'm doing following:
> 1. regclassin() lookups 'idx', converts it into oid without taking any lock
> 2. bt_index_check() gets this oid and lookups index again.
> Do we rely on catalog snapshot here? Or how do we know that this oid
> is still of the index named 'idx' on standby? When backend will
> understand that this oid is no more of the interest and get an error
> somehow? I think it must happen during index_open().

Yeah, I can see that, and I think that it is a second issue, rather
independent of the cache lookup errors that we are seeing on a
standby. As far as I have tested, it is possible to see this error as
well with a sequence of CREATE/DROP INDEX CONCURRENTLY run repeatedly
on a primary with the standby doing a scan of pg_index, like that for
example in a tight loop:
SELECT bt_index_check(indexrelid) FROM pg_index WHERE indisvalid AND
indexrelid > 16000;

It is a matter of seconds to see bt_index_check() complain that one of
the so-said index is not valid, but the scan of pg_index was told to
not select such an index, so we are missing something with the
relation cache.

> BTW is anyone working on this bug?

It took me some time to come back to this thread, but I do now that we
are done with the commit fest business.

Speaking of the cache errors when working directly on the relations
rebuilt, I got a patch that seems to work properly, with a mix of
standby snapshots and AELs logged. I have been running a set of
REINDEX INDEX/TABLE CONCURRENTLY on the primary through more than 100k
relfilenodes without the standby complaining, and I should be able to
get this one fixed before the next minor release.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Julien Rouhaud 2022-04-20 09:18:47 Re: Help Me! ^^ "must be superuser to make network requests"
Previous Message 김인철 2022-04-20 04:52:59 Help Me! ^^ "must be superuser to make network requests"