Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-05-14 23:33:18
Message-ID: CAAKRu_bioPMfwpA2zgK=eNq402YsvTCrxfQ_o0PvCBejFsTu=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, May 13, 2024 at 11:42 PM Bowen Shi <zxwsbg12138(at)gmail(dot)com> wrote:
>
> On Mon, May 13, 2024 at 10:42 PM Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>>
>> On Sun, May 12, 2024 at 11:19 PM Bowen Shi <zxwsbg12138(at)gmail(dot)com> wrote:
>> >
>> > Hi,
>> >>
>> >> Obviously we should actually fix this on back branches, but could we
>> >> at least make the retry loop interruptible in some way so people could
>> >> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
>> >> vacuum process?
>> >
>> >
>> > If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive lock cause checkpoint hangs).
>> >
>> > Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot simply use CHECK_FOR_INTERRUPTS to handle it.
>>
>> As far as I can tell, in 14 and 15, the versions where the issue
>> reported here is present, there is not a critical section in the
>> section of code looped through in the retry loop in lazy_scan_prune().
>
>
> You are correct, I tried again to add CHECK_FOR_INTERRUPTS in the retry loop, and when attempting to interrupt the current loop using the pg_terminate_backend function, the value of InterruptHoldoffCount is 1, which causes the vacuum to not end.

Yes, great point. Actually, Andres and I discussed this today
off-list, and he reminded me that because vacuum is holding a content
lock on the page here, InterruptHoldoffCount will be at least 1. We
could RESUME_INTERRUPTS() but we probably don't want to process
interrupts while holding the page lock here if we don't do it in other
places. And it's hard to say under what conditions we would want to
drop the page lock here.

Are you reproducing the hang locally with my repro? Or do you have
your own repro? How are you testing pg_terminate_backend() and seeing
that the InterruptHoldoffCount is 1?

>> We can actually fix the particular issue I reproduced with the
>> attached patch. However, I think it is still worth making the retry
>> loop interruptible in case there are other ways to end up infinitely
>> looping in the retry loop in lazy_scan_prune().
>
>
> I attempted to apply the patch on the REL_15_STABLE branch, but encountered some conflicts. May I ask which branch you are using?

Sorry, I should have mentioned that patch was against REL_14_STABLE.
Attached patch has the same functionality but should apply cleanly
against REL_15_STABLE.

- Melanie

Attachment Content-Type Size
fix_hang_15.patch text/x-patch 4.9 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-05-15 00:00:31 Re: BUG #18463: Possible bug in stored procedures with polymorphic OUT parameters
Previous Message PG Bug reporting form 2024-05-14 21:14:34 BUG #18465: Wrong results from SELECT DISTINCT MIN in scalar subquery using HashAggregate