| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie> |
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations |
| Date: | 2022-02-20 00:12:31 |
| Message-ID: | 87697D0B-3A41-4A5B-B9CF-85D989361E27@anarazel.de |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
(On phone, so crappy formatting and no source access)
On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> Another testing strategy occurs to me: we could stress-test the
>> implementation by simulating an environment where the no-cleanup-lock
>> path is hit an unusually large number of times, possibly a fixed
>> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
>> ConditionalLockBufferForCleanup() call return false randomly. Now that
>> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
>> similar to the regular lazy_scan_prune path as possible), I wouldn't
>> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
>> disruptive.
>
>I tried this out, using the attached patch. It was quite interesting,
>even when run against HEAD. I think that I might have found a bug on
>HEAD, though I'm not really sure.
>
>If you modify the patch to simulate conditions under which
>ConditionalLockBufferForCleanup() fails about 2% of the time, you get
>much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
>without it being so aggressive as to make "make check-world" fail --
>which is exactly what I expected. If you are much more aggressive
>about it, and make it 50% instead (which you can get just by using the
>patch as written), then some tests will fail, mostly for reasons that
>aren't surprising or interesting (e.g. plan changes). This is also
>what I'd have guessed would happen.
>
>However, it gets more interesting. One thing that I did not expect to
>happen at all also happened (with the current 50% rate of simulated
>ConditionalLockBufferForCleanup() failure from the patch): if I run
>"make check" from the pg_surgery directory, then the Postgres backend
>gets stuck in an infinite loop inside lazy_scan_prune, which has been
>a symptom of several tricky bugs in the past year (not every time, but
>usually). Specifically, the VACUUM statement launched by the SQL
>command "vacuum freeze htab2;" from the file
>contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
>misbehavior.
>This is a temp table, which is a choice made by the tests specifically
>because they need to "use a temp table so that vacuum behavior doesn't
>depend on global xmin". This is convenient way of avoiding spurious
>regression tests failures (e.g. from autoanalyze), and relies on the
>GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
>94bc27b5.
We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be surprised if a cleanup lock failing would cause some odd behavior.
>It's quite possible that this is nothing more than a bug in my
>adversarial gizmo patch -- since I don't think that
>ConditionalLockBufferForCleanup() can ever fail with a temp buffer
>(though even that's not completely clear right now). Even if the
>behavior that I saw does not indicate a bug on HEAD, it still seems
>informative. At the very least, it wouldn't hurt to Assert() that the
>target table isn't a temp table inside lazy_scan_noprune, documenting
>our assumptions around temp tables and
>ConditionalLockBufferForCleanup().
Definitely worth looking into more.
This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked, instead of blocking. Not actually related, of course.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Joseph Koshakow | 2022-02-20 00:16:54 | Re: Fix overflow in DecodeInterval |
| Previous Message | Justin Pryzby | 2022-02-19 23:53:09 | Re: set TESTDIR from perl rather than Makefile |