From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: failures in t/031_recovery_conflict.pl on CI |
Date: | 2022-05-03 18:20:25 |
Message-ID: | 20220503182025.wvbebs2ojk6vpi5f@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2022-05-03 01:16:46 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
> >> I can poke into that tomorrow, but are you sure that that isn't an
> >> expectable result?
>
> > It's not expected. But I think I might see what the problem is:
> > We wait for the FETCH (and thus the buffer pin to be acquired). But that
> > doesn't guarantee that the lock has been acquired. We can't check that with
> > pump_until() afaics, because there'll not be any output. But a query_until()
> > checking pg_locks should do the trick?
>
> Irritatingly, it doesn't reproduce (at least not easily) in a manual
> build on the same box.
Odd, given how readily it seem to reproduce on the bf. I assume you built with
> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
> So it's almost surely a timing issue, and your theory here seems plausible.
Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...
# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
"$sect: lock acquisition is waiting");
and on longfin that step completes sucessfully.
I think what happens is that we get a buffer pin conflict, because these days
we can actually process buffer pin conflicts while waiting for a lock. The
easiest way to get around that is to increase the replay timeout for that
test, I think?
I think we need a restart, not a reload, because reloads aren't guaranteed to
be processed at any certain point in time :/.
Testing a fix in a variety of timing circumstances now...
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2022-05-03 18:23:23 | Re: failures in t/031_recovery_conflict.pl on CI |
Previous Message | Tom Lane | 2022-05-03 18:13:54 | Re: fix cost subqueryscan wrong parallel cost |