Proposal for Resumable Vacuum (again ...)

From: Jay <jsudrikoss(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal for Resumable Vacuum (again ...)
Date: 2024-03-24 04:27:15
Message-ID: CAPdcCKpvZiRCoDxQoo9mXxXAK8w=bX5NQdTTgzvHV2sUXp0ihA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I am aware of few previous attempts and discussions on this topic
(eventually shelved or didn't materialize):

- https://www.postgresql.org/message-id/45E2A6AE.1080805@oss.ntt.co.jp
-
https://www.postgresql.org/message-id/CA%2BTgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A%40mail.gmail.com

-
https://www.postgresql.org/message-id/flat/CAD21AoBqfMVWdk7Odh4A4OpF-m5GytRjXME5E8cEGXvhSJb8zw(at)mail(dot)gmail(dot)com

And still I want to revise this topic for the obvious benefits.

I do not have any patch or code changes ready. The changes could be tricky
and might need efforts, possibly some re-factoring. Hence, before starting
the effort, I would like to get the proposal reviewed and consensus built
to avoid redundancy of efforts.

*Why do we need it? *

Since more and more large businesses are on-boarding PostgreSQL, it is only
fair that we make the essential utilities like vacuum more manageable and
scalable. The data sizes are definitely going to increase and maintenance
windows will reduce with businesses operating across the time zones and
24x7. Making the database more manageable with the least overhead is going
to be definitely a pressing need.

To avoid the repetition and duplicate efforts, I have picked up the snippet
below from the previous email conversation on the community (Ref:
https://www.postgresql.org/message-id/45E2A6AE.1080805@oss.ntt.co.jp)

<Quote>
*For a large table, although it can be vacuumed by enabling vacuum
cost-based delay, the processing may last for several days (maybe hours).
It definitely has a negative effect on system performance. So if systems
which have maintenance time, it is preferred to vacuum in the maintenance
window. Vacuum tasks can be split into small subtasks, and they can be
scheduled into maintenance window time slots. This can reduce the impact of
vacuum to system service.*

*But currently vacuum tasks can not be split: if an interrupt or error
occurs during vacuum processing, vacuum totally forgets what it has done
and terminates itself. Following vacuum on the same table has to scan from
the beginning of the heap block. This proposal enable vacuum has capability
to stop and resume.*
</Quote>

*External Interface*

This feature is especially useful when the size of table/s is quite large
and their bloat is quite high and it is expected vacuum runs will take long
time.

Ref: https://www.postgresql.org/docs/current/sql-vacuum.html
vacuum [ ( *option* [, ...], *[{ for time = hh:mm}| {resume [for time =
hh:mm]}] *) ] [ *table_and_columns* [, ...] ]

The additional options give flexibility to run the vacuum for a limited
time and stop or resume the vacuum from the last time when it was stopped
for a given time.

When vacuum is invoked with ‘for time ...’ option it will store the
intermediate state of the dead tuples accumulated periodically on the disk
as it progresses. It will run for a specified time and stop after that
duration.

When vacuum is invoked with ‘for time ...’ option and is stopped
automatically after the specified time or interrupted manually and if it is
invoked next time with ‘resume’ option, it will try to check the stored
state of the last run and try to start as closely as possible from where it
left last time and avoid repetition of work.

When resumed, it can either run for a specified time again (if the duration
is specified) or run till completion if the duration is not specified.

When vacuum is invoked with ‘resume for’ option when there was no earlier
incomplete run or an earlier run with ‘for time’ option, the ‘for resume’
option will be ignored with a message in the errorlog.

When vacuum is invoked without ‘for time’ or ‘resume for’ options after
preceding incomplete runs with those options , then the persisted data from
the previous runs is discarded and deleted. This is important because
successive runs with ‘for time’ or ‘resume for’ assume the persisted data
is valid and there’s no run in between to invalidate it and the state of
heap pages in terms of vacuum is the same for the given saved vacuum
horizon.

In further discussion in the rest of this proposal, we will refer to vacuum
invoked with ‘for time’ or ‘resume for’ option as listed above as
‘resumable vacuum’.

Internal Changes (High level)For each table, vacuum progresses in the
following steps or phases (taken from the documentation)
https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PHASES
<https://www.postgresql.org/docs/current/progress-reporting.html> :

1. Initializing - VACUUM is preparing to begin scanning the heap. This
phase is expected to be very brief.

2. Scanning heap - VACUUM is currently scanning the heap. It will prune and
defragment each page if required, and possibly perform freezing activity.
The heap_blks_scanned column can be used to monitor the progress of the
scan.

3. Vacuuming Indexes - VACUUM is currently vacuuming the indexes. If a
table has any indexes, this will happen at least once per vacuum, after the
heap has been completely scanned. It may happen multiple times per vacuum
if maintenance_work_mem
<https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM>
(or,
in the case of autovacuum, autovacuum_work_mem
<https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-AUTOVACUUM-WORK-MEM>
if
set) is insufficient to store the number of dead tuples found.

4. Vacuuming Heap - VACUUM is currently vacuuming the heap. Vacuuming the
heap is distinct from scanning the heap, and occurs after each instance of
vacuuming indexes. If heap_blks_scanned is less than heap_blks_total, the
system will return to scanning the heap after this phase is completed;
otherwise, it will begin cleaning up indexes after this phase is completed.

5. Cleaning up indexes - VACUUM is currently vacuuming the heap. Vacuuming
the heap is distinct from scanning the heap, and occurs after each instance
of vacuuming indexes. If heap_blks_scanned is less than heap_blks_total,
the system will return to scanning the heap after this phase is completed;
otherwise, it will begin cleaning up indexes after this phase is completed.

6. Truncating heap - VACUUM is currently truncating the heap so as to
return empty pages at the end of the relation to the operating system. This
occurs after cleaning up indexes.

7. Performing final cleanup - VACUUM is performing final cleanup. During
this phase, VACUUM will vacuum the free space map, update statistics in
pg_class, and report statistics to the cumulative statistics system. When
this phase is completed, VACUUM will end.

The resumable vacuum will store the following information during its run on
the disk:

1. Database Name/Id, Table Name/Oid
2. Phase of its run (from the ones listed in the table above)
3. Vacuum horizon (created during ‘initializing’ phase)
4. Array of dead tuple-ids accumulated in the ‘heap scan’ phase.
5. Number of times the dead tuple ids array was spilled over due to
memory limitation.

Out of the above information, #2 and #3 will be updated as it progresses.

When it is resumed, vacuum will first check the persisted information from
the last run. It will

1. Retrieve all the stored information on the disk from the last run
(listed above)
2. Check what phase the vacuum was during the last run when it stopped
3. Based on the phase, it will adapt the further actions based on that.
1. If the previous run was stopped in the initialization phase, then
the new run will start from scratch.
2. If the previous run was stopped during the heap scan phase, it
will use the persisted array of dead tids and it will start the scan from
the last accumulated dead tuple.
3. If the previous run was stopped during the ‘vacuuming index’
phase, it will start this phase all over, but from the beginning of the
stored dead tuple ids.
4. If the previous run was stopped during the ’vacuuming heap’ phase,
and without previous spillovers, it will continue vacuuming heap for the
rest of the dead tuple array and proceed for further phases. If
it was with
previous spillovers, then it will continue vacuuming the heap
for the rest
of the dead tuple array and go again for phase 2 if the heap
scan is still
incomplete and follow the loop which is already there.
5. If the previous run was stopped during any of the remaining
phases, it will just complete the remaining work and exit.

Indexes can change across the runs. Please note that in this proposal any
re-run above does not depend on the last state of the indices. Any actions
in this whole proposal does not depend on the last state of indices nor
does it store it.

This approach does not add any overhead in the DML code path. The changes
are limited only to vacuum operation and just enough to make it resumable.
There are no drastic changes to the regular flow.

This approach doesn’t change the core functions apart from conditionally
persisting the vacuum progress information to the disk. Thus, any future
enhancements to the core functions can be easily accommodated.

Please let me know or comment on this so that we can conclude if this does
look like a reasonable enhancement.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-03-24 05:10:19 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Bharath Rupireddy 2024-03-24 02:39:19 Re: [HACKERS] make async slave to wait for lsn to be replayed