From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>, David Rowley <dgrowley(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
Subject: | Re: Trigger more frequent autovacuums of heavy insert tables |
Date: | 2025-02-25 16:06:59 |
Message-ID: | Z73qo9Pqr1qBbq0E@nathan |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Feb 25, 2025 at 10:38:48AM -0500, Robert Haas wrote:
> On Mon, Feb 24, 2025 at 6:35 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
>> I'm on the fence about adding a PANIC. We do PANIC in other places
>> where we notice corruption (like PageAddItemExtended()). But, in most
>> of the cases, it seems like we are PANICing because there isn't a
>> reasonable way to accomplish the intended task. In this case, we
>> probably can't trust the visibility map counts for that page, but the
>> pg_class columns are just estimates, so just capping relallfrozen
>> might be good enough.
>
> +1 for not unnecessarily inflating the log level. I agree with the
> principle you articulate here: a PANIC is reasonable when there's no
> sane way to continue, but for other corruption events, a WARNING is
> better. Note that one really bad consequence of using ERROR or any
> higher level in VACUUM is that VACUUM doesn't ever complete. We just
> keep retrying and dying. Even if it's only an ERROR, now we're no
> longer controlling bloat or preventing wraparound. That's extremely
> painful for users, so we don't want to end up there if we have a
> reasonable alternative.
>
> Also, I think that it's usually a bad idea to use an Assert to test
> for data corruption scenarios. The problem is that programmers are
> more or less entitled to assume that an assertion will always hold
> true, barring bugs, and just ignore completely what the code might do
> if the assertion turns out to be false. But data does get corrupted on
> disk, so you SHOULD think about what the code is going to do in such
> scenarios. Depending on the details, you might want to WARNING or
> ERROR or PANIC or try to repair it or try to just tolerate it being
> wrong or something else -- but you should be thinking about what the
> code is actually going to do, not just going "eh, well, that can't
> happen".
In this case, it sounds to me like we want production builds to tolerate
visibility map corruption (at least, to some extent), but we don't really
expect to see it regularly. If that's true, then my inclination would be
to add assertions to make sure nobody has fundamentally broken the code,
but to at most either warn or restrict the values for non-assert builds.
TBH my gut feeling is that, outside of assert-enabled builds, we should
just let the values be bogus if they are indeed just estimates that can be
wildly inaccurate anyway. I see little value in trying to gently corral
the value in certain cases, lest we want to mask bugs and give credence to
bad assumptions about the possible values.
I think the argument for PANIC-ing in this case is that visibility map
corruption seems likely to be the tip of the iceberg. If we see that one
file is corrupted, I would expect a DBA to be extremely wary of trusting
the other files on the system, too. (I will freely admit that I might be
missing situations where race conditions or the like cause us to retrieve
seemingly invalid values for relpages and relallvisible, so please pardon
my ignorance as I dive deeper into this code.)
--
nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-02-25 16:08:53 | Re: Redact user password on pg_stat_statements |
Previous Message | Melanie Plageman | 2025-02-25 16:02:40 | Re: Trigger more frequent autovacuums of heavy insert tables |