Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-08-18 11:29:05
Message-ID: CAN-RpxD8k_T4B8tgCWuu=0GnUEnN6NaepiUx8WMUPwUgHEgeuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > 1. There's nothing to identify the tuple that has the problem, and no
> > way to know how many more of them there might be. Back-patching
> > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > part of this.
>
> I am in favor of backpatching such changes in cases where senior
> community members feel that it could help with hypothetical
> undiscovered data corruption issues -- if they're willing to take
> responsibility for the change. It certainly wouldn't be the first
> time. A "defense in depth" mindset seems like the right one when it
> comes to data corruption bugs. Early detection is really important.
>
> > Moreover, not everyone is as
> > interested in an extended debugging exercise as they are in getting
> > the system working again, and VACUUM failing repeatedly is a pretty
> > serious problem.
>
> That's absolutely consistent with my experience. Most users want to
> get back to business as usual now, while letting somebody else do the
> hard work of debugging.
>

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were
running on older versions and older servers, making debugging largely a
futile exercise.

>
> > Therefore, one of my colleagues has - at my request - created a couple
> > of functions called heap_force_kill() and heap_force_freeze() which
> > take an array of TIDs.
>
> > So I have these questions:
> >
> > - Do people think it would me smart/good/useful to include something
> > like this in PostgreSQL?
>
> I'm in favor of it.
>

+1

Would be worth extending it with some functions to grab rows that have
various TOAST oids too.

>
> > - If so, how? I would propose a new contrib module that we back-patch
> > all the way, because the VACUUM errors were back-patched all the way,
> > and there seems to be no advantage in making people wait 5 years for a
> > new version that has some kind of tooling in this area.
>
> I'm in favor of it being *possible* to backpatch tooling that is
> clearly related to correctness in a fundamental way. Obviously this
> would mean that we'd be revising our general position on backpatching
> to allow some limited exceptions around corruption. I'm not sure that
> this meets that standard, though. It's hardly something that we can
> expect all that many users to be able to use effectively.
>
> I may be biased, but I'd be inclined to permit it in the case of
> something like amcheck, or pg_visibility, on the grounds that they're
> more or less the same as the new VACUUM errcontext instrumentation you
> mentioned. The same cannot be said of something like this new
> heap_force_kill() stuff.
>
> > - Any ideas for additional things we should include, or improvements
> > on the sketch above?
>
> Clearly you should work out a way of making it very hard to
> accidentally (mis)use. For example, maybe you make the functions check
> for the presence of a sentinel file in the data directory.
>

Agreed.

>
>
> --
> Peter Geoghegan
>
>
>

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2020-08-18 11:32:00 Re: Libpq support to connect to standby server as priority
Previous Message Ashutosh Sharma 2020-08-18 11:21:38 Re: recovering from "found xmin ... from before relfrozenxid ..."