Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Date: 2024-07-23 22:39:49
Message-ID: CAD21AoBtupSb6xjJ-4E7YD+vmuc2iTd91BFtUFyLYCoEjCDFkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 23, 2024 at 5:43 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman
> > <melanieplageman(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > >
> > > > Melanie Plageman <melanieplageman(at)gmail(dot)com> writes:
> > > > > We've only run tests with this commit on some of the back branches for
> > > > > some of these animals. Of those, I don't see any failures so far. So,
> > > > > it seems the test instability is just related to trying to get
> > > > > multiple passes of index vacuuming reliably with TIDStore.
> > > >
> > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > > > > failures on copperhead (a 64 bit machine) are because we don't
> > > > > actually succeed in triggering a second vacuum pass. This would not be
> > > > > fixed by a longer timeout.
> > > >
> > > > Ouch. This seems to me to raise the importance of getting a better
> > > > way to test multiple-index-vacuum-passes. Peter argued upthread
> > > > that we don't need a better way, but I don't see how that argument
> > > > holds water if copperhead was not reaching it despite being 64-bit.
> > > > (Did you figure out exactly why it doesn't reach the code?)
> > >
> > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum
> > > pass) on my local machine (which is 64 bit) without decreasing the
> > > number of tuples inserted. The copperhead failure confuses me because
> > > the speed of the machine should *not* affect how much space the dead
> > > item TIDStore takes up. I would have bet money that the same number
> > > and offsets of dead tuples per page in a relation would take up the
> > > same amount of space in a TIDStore on any 64-bit system -- regardless
> > > of how slowly it runs vacuum.
> >
> > Looking at copperhead's failure logs, I could not find that "VACUUM
> > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index
> > scans in logs. Is there any clue that made you think the test failed
> > to do multiple index vacuum passes?
>
> The vacuum doesn't actually finish because I have a cursor that keeps
> it from finishing and then I query pg_stat_progress_vacuum after the
> first index vacuuming round should have happened and it did not do the
> index vacuum:
>
> [20:39:34.645](351.522s) # poll_query_until timed out executing this query:
> #
> # SELECT index_vacuum_count > 0
> # FROM pg_stat_progress_vacuum
> # WHERE datname='test_db' AND relid::regclass =
> 'vac_horizon_floor_table'::regclass;
> #
> # expecting this output:
> # t
> # last actual query output:
> # f
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11
>
> I suppose it is possible that it did in fact time out and the index
> vacuum was still in progress. But most of the other "too slow"
> failures were when the standby was trying to catch up. Usually the
> pg_stat_progress_vacuum test fails because we didn't actually do that
> index vacuuming round yet.

Thank you for your explanation! I understood the test cases.

I figured out why two-round index vacuum was not triggered on
copperhead although it's a 64-bit system. In short, this test case
depends on MEMORY_CONTEXT_CHECK (or USE_ASSERT_CHECKING) being on.

In this test case, every BlocktableEntry size would be 16 bytes; the
header is 8 bytes and offset bitmap is 8 bytes (covering up to offset
63). We calculate the memory size (required_size in BumpAlloc()) to
allocate in a bump memory context as follows:

#ifdef MEMORY_CONTEXT_CHECKING
/* ensure there's always space for the sentinel byte */
chunk_size = MAXALIGN(size + 1);
#else
chunk_size = MAXALIGN(size);
#endif

(snip)

required_size = chunk_size + Bump_CHUNKHDRSZ;

Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
and use more Bump memory blocks, resulting in filling up TidStore in
the test cases. We can easily reproduce this test failure with
PostgreSQL server built without --enable-cassert. It seems that
copperhead is the sole BF animal that doesn't use --enable-cassert but
runs recovery-check.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-07-23 22:59:28 Re: xid_wraparound tests intermittent failure.
Previous Message Corey Huinker 2024-07-23 21:50:17 Re: Statistics Import and Export