Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date: 2021-11-01 15:15:27
Message-ID: CAEze2Wj7O5tnM_U151Baxr5ObTJafwH=71_JEmgJV+6eBgjL7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> > I can propose the debugging patch to reproduce the issue that replaces
> > the hang with the assert and modifies a pair of crash-causing test
> > scripts to simplify the reproducing. (Sorry, I have no time now to prune
> > down the scripts further as I have to leave for a week.)
>
> This bug is similar to the one fixed in commit d9d8aa9b. And so I
> wonder if code like GlobalVisTestFor() is missing something that it
> needs for partitioned tables.

Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.

I think that I also have found the culprit, which is something we
talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.

It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0] (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.

I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.

Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).

Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/20210609184506.rqm5rikoikm47csf%40alap3.anarazel.de#e9d55b5cfff34238a24dc85c8c75a46f

Attachment Content-Type Size
0001-Fix-stuck-vacuum-due-to-retreating-GlobalVisState-ma.patch application/octet-stream 2.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-11-01 21:09:00 BUG #17261: FK ON UPDATE CASCADE can break referential integrity with columns of different types
Previous Message Robert Haas 2021-11-01 14:17:46 Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data