Re: BUG #17741: vacuum process hangs after pg_surgery manipulations

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: a(dot)kozhemyakin(at)postgrespro(dot)ru, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17741: vacuum process hangs after pg_surgery manipulations
Date: 2023-01-18 01:40:39
Message-ID: CAD21AoBYvTfc9E+3p6ecN2n=UsftggWaQiZo1xtYnObQ-uTiQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jan 17, 2023 at 12:37 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2023-Jan-09, PG Bug reporting form wrote:
>
> > On the REL_15_STABLE, you can hang vacuum freeze. Maybe this is not
> > desired?
> > https://www.postgresql.org/docs/current/pgsurgery.html
> >
> > reproduce script:
> > create extension pg_surgery;
>
> Using pg_surgery is the equivalent of introducing corruption in your
> data. It has, of course, completely valid uses, but if you break the
> system while using it, it's on you to fix it.
>
> The pg_surgery documentation you cite states:
>
> : These functions are unsafe by design and using them may corrupt (or
> : further corrupt) your database.
>
> So, you've been warned.

While this is completely true and I agree, can we improve this
situation somewhat so that it ends up with an error instead of getting
hanged?

In this case, the tuple with a = 1, the root of the HOT chain, was
killed, and the tuple with a = 2 was heap-only tuple and HOT-updated.
In heap_page_prune(), we normally can prune the tuple with a = 2 as
part of pruning its chain, but since the root tuple was already killed
we could not prune this tuple. Then, we ended up retrying
heap_page_prune() since we saw as if the tuple became dead since
heap_page_prune() looked. Normally retrying heap_page_prune() works
but in this case since we didn't have the root tuple it misses again,
and gets hanged after all. I think that we didn't have this hang
before 8523492d4e3 even in the same corruption case. One idea is to
improve this situation is that we have a sanity check that we have
retired due to the same tuple.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sam.Mesh 2023-01-18 01:43:37 index not used for bigint without explicit cast
Previous Message Andres Freund 2023-01-17 20:04:32 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1