Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Date: 2024-07-22 13:25:31
Message-ID: CAAKRu_Y8x4vo3JMfdepC=+jwPk=2HPUQUcPcfkoXQ2_d9-BGqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 21, 2024 at 12:51 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Melanie Plageman <melanieplageman(at)gmail(dot)com> writes:
> > When I run it on my machine with some added logging, the space taken
> > by dead items is about 330 kB more than maintenance_work_mem (which is
> > set to 1 MB). I could roughly double the excess by increasing the
> > number of inserted tuples from 400000 to 600000. I'll do this.
>
> So, after about two days in the buildfarm, we have failure reports
> from this test on gull, mamba, mereswine, and copperhead. mamba
> is mine, and I was able to reproduce the failure in a manual run.
> The problem seems to be that the test simply takes too long and
> we hit the default 180-second timeout on one step or another.
> I was able to make it pass by dint of
>
> $ export PG_TEST_TIMEOUT_DEFAULT=1800
>
> However, the test then took 908 seconds:

Thanks for taking the time to do this. If the test failures can be
fixed by increasing timeout, that means that at least multiple index
vacuums are reliably triggered with that number of rows. Obviously we
can't have a super slow, flakey test, but I was worried the test might
fail on different platforms because somehow the row count was
insufficient to cause multiple index vacuums on some platforms for
some reason (due to adaptive radix tree size being dependent on many
factors).

> $ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl
> ...
> # +++ tap install-check in src/test/recovery +++
> t/043_vacuum_horizon_floor.pl .. ok
> All tests successful.
> Files=1, Tests=3, 908 wallclock secs ( 0.17 usr 0.01 sys + 21.42 cusr 35.03 csys = 56.63 CPU)
> Result: PASS
> 909.26 real 22.10 user 35.21 sys
>
> This is even slower than the 027_stream_regress.pl test, which
> currently takes around 847 seconds on that machine.
>
> mamba, gull, and mereswine are 32-bit machines, which aside from
> being old and slow suffer an immediate 2x size-of-test penalty:
>
> >> # The TIDStore vacuum uses to store dead items is optimized for its target
> >> # system. On a 32-bit system, our example requires twice as many pages with
> >> # the same number of dead items per page to fill the TIDStore and trigger a
> >> # second round of index vacuuming.
> >> my $is_64bit = $node_primary->safe_psql($test_db,
> >> qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
> >>
> >> my $nrows = $is_64bit eq 't' ? 400000 : 800000;
>
> copperhead is 64-bit but is nonetheless even slower than the
> other three, so the fact that it's also timing out isn't
> that surprising.
>
> I do not think the answer to this is to nag the respective animal
> owners to raise PG_TEST_TIMEOUT_DEFAULT. IMV this test is simply
> not worth the cycles it takes, at least not for these machines.
> I'm not sure whether to propose reverting it entirely or just
> disabling it on 32-bit hardware. I don't think we'd lose anything
> meaningful in test coverage if we did the latter; but that won't be
> enough to make copperhead happy. I am also suspicious that we'll
> get bad news from other very slow animals such as dikkop.

I am happy to do what Peter suggests and move it to PG_TEST_EXTRA, to
disable for 32-bit, or to revert it.

> I wonder if there is a less expensive way to trigger the test
> situation than brute-forcing things with a large index.
> Maybe the injection point infrastructure could help?

The issue with an injection point is that we need more than for the
vacuuming backend to pause at a specific point, we need a refresh of
GlobalVisState to be forced at that point. Even if the horizon moves
backward on the primary, this backend won't notice unless it has to
update its GlobalVisState -- which often happens due to taking a new
snapshot but this also happens at the end of index vacuuming
explicitly.

- Melanie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-07-22 13:30:24 Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Previous Message Robert Haas 2024-07-22 13:13:36 Re: xid_wraparound tests intermittent failure.