From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS |
Date: | 2014-12-15 15:49:29 |
Message-ID: | 20141215154929.GI5023@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> contrib/test_decoding with
>
> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)
>
> for example here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
> and here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05
Yes, I've been trying to debug that. I've finally managed to reproduce
locally. Unfortunately it's not directly reproducible on my laptop...
A fair amount of tinkering later I've found out that it's not actually
CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup
interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens
to make parts of the code slow enough to not reach a ordinary
AcceptInvalidationMessages() in time.
The problem is that in the added regression test there can be a
situation where there's a relcache entry that's *not* currently visible
but still referenced. Consider:
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
CREATE TABLE somechange(id serial primary key);
INSERT INTO changeresult
SELECT data FROM pg_logical_slot_get_changes(...);
While get_changes stores it data into a tuplestore there can be a moment
where 'changeresult' has a refcnt > 0 due to the INSERT, but is
invisible because we're using a historic snapshot from when changeresult
doesn't exist.
Without catchup invalidations and/or an outside reference to a relation
that's normally not a problem because it won't get reloaded from the
caches at an inappropriate time while invisible. Until a few weeks ago
there was no regression test covering that case which is why these
crashes are only there now.
It triggers via:
RelationCacheInvalidate() ->
RelationClearRelation(relation, true) ->
/* Build temporary entry, but don't link it into hashtable */
newrel = RelationBuildDesc(save_relid, false);
if (newrel == NULL)
{
/* Should only get here if relation was deleted */
RelationCacheDelete(relation);
RelationDestroyRelation(relation, false);
elog(ERROR, "relation %u deleted while still in use", save_relid);
}
That path is only hit while refcnt > 0
RelationDestroyRelation() has
Assert(RelationHasReferenceCountZero(relation));
So that path doesn't really work... Without assertions we'd "just" get a
transient ERROR. I think that path should generally loose the
RelationDestroyRelation() - if it's referenced that's surely not the
right thing to do. I'm not actually convinced logical decoding is the
only way that assert can be reached.
Since logical decoding happens inside a subtransaction (when called via
SQL) and invalidate caches one relatively straightforward way to fix
this would be to add something roughly akin to:
Assert(!relation->rd_isvalid)
/*
* This should only happen when we're doing logical decoding where
* it can happen when [above].
*/
if (HistoricSnapshotActive())
return;
There's no chance that such a entry could survive too long in an invalid
state (minus preexisting bugs independent of decoding) since we hit the
same paths that subtransaction abort hits.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2014-12-15 15:50:12 | Re: Fractions in GUC variables |
Previous Message | Tom Lane | 2014-12-15 15:45:55 | Re: Join push-down support for foreign tables |