From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <tmunro(at)postgresql(dot)org>, pgsql-committers <pgsql-committers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: pgsql: Test replay of regression tests, attempt II. |
Date: | 2022-01-20 04:23:30 |
Message-ID: | CA+hUKG+nHX+NNjm-ig0zWLxeMiivH8omey5Onfhnxzh6g524Cg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers |
On Wed, Jan 19, 2022 at 12:08 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-01-18 17:19:06 -0500, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > That's an extremely small shared_buffers for running the regression tests, it'd not
> > > be surprising if that provoked problems we don't otherwise see. Perhaps VACUUM
> > > ends up skipping over a page because of page contention?
> >
> > Hmm, good thought. I tried running the test with even smaller
> > shared_buffers, but could not make the reloptions test fall over for
> > me. But this theory implies a strong timing dependency, so it might
> > still only happen on particular machines. (If anyone else tries it:
> > below about 400kB, other tests start failing with "no free unpinned
> > buffers" and the like.)
>
> I ran the test in a loop for 200+ times now, without reproducing the
> problem. Rorqual runs on a shared machine though, so it's quite possible that
> IO will be slower, and thus triggering the issue.
>
> I was wondering whether we could use VACUUM VERBOSE for that specific VACUUM -
> that'd show information about the number of pages with tuples etc. But I don't
> currently see a way of that causing the regression tests to fail.
>
> Even if I set client_min_messages=error, the messages still get sent to the
> client, because elevel == INFO is special cased in
> should_output_to_client(). And I don't see a way of redirecting the output of
> common.c:NoticeProcessor() in psql either.
I hacked a branch thusly:
@@ -327,6 +327,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
verbose = (params->options & VACOPT_VERBOSE) != 0;
instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
params->log_min_duration >= 0));
+ instrument = true;
if (instrument)
{
pg_rusage_init(&ru0);
Having failed to reproduce this locally, I clicked on "re-run tests"
all afternoon on CI until eventually I captured a failure log[1]
there, with the smoking gun:
pages: 0 removed, 1 remain, 1 skipped due to pins, 0 skipped frozen
There are three places that skip and bump that counter, but two of
them were disabled when I added DISABLE_PAGE_SKIPPING, leaving this
one:
LockBuffer(buf, BUFFER_LOCK_SHARE);
if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
{
UnlockReleaseBuffer(buf);
vacrel->scanned_pages++;
vacrel->pinskipped_pages++;
if (hastup)
vacrel->nonempty_pages = blkno + 1;
continue;
}
Since this page doesn't require wraparound vacuuming, if we fail to
conditionally acquire the cleanup lock, this block skips the page.
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-01-20 05:24:04 | Re: pgsql: Test replay of regression tests, attempt II. |
Previous Message | Tom Lane | 2022-01-20 01:07:31 | Re: pgsql: Make configure prefer python3 to plain python. |