Re: Track the amount of time waiting due to cost_delay

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Track the amount of time waiting due to cost_delay
Date: 2024-06-13 11:56:26
Message-ID: ZmreasV0BKZ5BHIM@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote:
> Hi,
>
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
>
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
>
> Thanks for looking at it!
>
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
>
> I see, I can/will do some testing in this area and share the numbers.

Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

====================== Context ================================================

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
FROM pg_buffercache b JOIN pg_class c
ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace
GROUP BY n.nspname, c.relname
ORDER BY 3 DESC
LIMIT 22;

nspname | relname | buffers
---------+--------------------+---------
public | large_tbl | 222280
public | large_tbl_filler13 | 125000
public | large_tbl_filler6 | 125000
public | large_tbl_filler5 | 125000
public | large_tbl_filler3 | 125000
public | large_tbl_filler15 | 125000
public | large_tbl_filler4 | 125000
public | large_tbl_filler20 | 125000
public | large_tbl_filler18 | 125000
public | large_tbl_filler14 | 125000
public | large_tbl_filler8 | 125000
public | large_tbl_filler11 | 125000
public | large_tbl_filler19 | 125000
public | large_tbl_filler7 | 125000
public | large_tbl_filler1 | 125000
public | large_tbl_filler12 | 125000
public | large_tbl_filler9 | 125000
public | large_tbl_filler17 | 125000
public | large_tbl_filler16 | 125000
public | large_tbl_filler10 | 125000
public | large_tbl_filler2 | 125000
public | large_tbl_pkey | 5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

====================== Results ================================================

========== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
=======================================================
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

leader_pid | pid | wait_event | count
------------+--------+----------------+-------
452996 | 453225 | VacuumDelay | 366
452996 | 453223 | VacuumDelay | 363
452996 | 453226 | VacuumDelay | 362
452996 | 453224 | VacuumDelay | 361
452996 | 453222 | VacuumDelay | 359
452996 | 453221 | VacuumDelay | 359
| 452996 | VacuumDelay | 331
| 452996 | CPU | 30
452996 | 453224 | WALWriteLock | 23
452996 | 453222 | WALWriteLock | 20
452996 | 453226 | WALWriteLock | 20
452996 | 453221 | WALWriteLock | 19
| 452996 | WalSync | 18
452996 | 453225 | WALWriteLock | 18
452996 | 453223 | WALWriteLock | 16
| 452996 | WALWriteLock | 15
452996 | 453221 | CPU | 14
452996 | 453222 | CPU | 14
452996 | 453223 | CPU | 12
452996 | 453224 | CPU | 10
452996 | 453226 | CPU | 10
452996 | 453225 | CPU | 8
452996 | 453223 | WalSync | 4
452996 | 453221 | WalSync | 2
452996 | 453226 | WalWrite | 2
452996 | 453221 | WalWrite | 1
| 452996 | ParallelFinish | 1
452996 | 453224 | WalSync | 1
452996 | 453225 | WalSync | 1
452996 | 453222 | WalWrite | 1
452996 | 453225 | WalWrite | 1
452996 | 453222 | WalSync | 1
452996 | 453226 | WalSync | 1

========== On master (v2 not applied)

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1322598.087 ms (22:02.598)

Surprisingly it has been longer on master by about 3 minutes.

Let's see how the time is splitted:

Vacuum phase: cumulative time
=============================
scanning heap: 00:08:07.061196
vacuuming indexes: 00:17:50.961228
vacuuming heap: 00:22:02.561199

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

leader_pid | pid | wait_event | count
------------+--------+-------------------+-------
468682 | 468858 | VacuumDelay | 548
468682 | 468862 | VacuumDelay | 547
468682 | 468859 | VacuumDelay | 547
468682 | 468860 | VacuumDelay | 545
468682 | 468857 | VacuumDelay | 543
468682 | 468861 | VacuumDelay | 542
| 468682 | VacuumDelay | 378
| 468682 | ParallelFinish | 182
468682 | 468861 | WALWriteLock | 19
468682 | 468857 | WALWriteLock | 19
468682 | 468859 | WALWriteLock | 18
468682 | 468858 | WALWriteLock | 16
468682 | 468860 | WALWriteLock | 15
468682 | 468862 | WALWriteLock | 15
468682 | 468862 | CPU | 12
468682 | 468857 | CPU | 10
468682 | 468859 | CPU | 10
468682 | 468861 | CPU | 10
| 468682 | CPU | 9
468682 | 468860 | CPU | 9
468682 | 468860 | WalSync | 8
| 468682 | WALWriteLock | 7
468682 | 468858 | WalSync | 6
468682 | 468858 | CPU | 6
468682 | 468862 | WalSync | 3
468682 | 468857 | WalSync | 3
468682 | 468861 | WalSync | 3
468682 | 468859 | WalSync | 2
468682 | 468861 | WalWrite | 2
468682 | 468857 | WalWrite | 1
468682 | 468858 | WalWrite | 1
468682 | 468861 | WALBufMappingLock | 1
468682 | 468857 | WALBufMappingLock | 1
| 468682 | WALBufMappingLock | 1

====================== Observations ===========================================

As compare to v2:

1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same

One difference we can see in the sampling, is that on master the "ParallelFinish"
has been sampled about 182 times (so could be _the_ 3 minutes of interest) for
the leader.

On master the vacuum indexes phase has been running between 2024-06-13 10:11:34
and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the
"ParallelFinish" wait event I get:

minute | wait_event | count
--------+----------------+-------
18 | ParallelFinish | 48
19 | ParallelFinish | 60
20 | ParallelFinish | 60
21 | ParallelFinish | 14

So it's likely that the leader waited on ParallelFinish during about 3 minutes
at the end of the vacuuming indexes phase (as this wait appeared during
consecutives samples).

====================== Conclusion =============================================

1. During the scanning heap and vacuuming heap phases no noticeable performance
degradation has been observed with v2 applied (as compare to master) (cc'ing
Robert as it's also related to his question about noticeable hit when everything
is in-memory in [1]).

2. During the vacuuming indexes phase, v2 has been faster (as compare to master).
The reason is that on master the leader has been waiting during about 3 minutes
on "ParallelFinish" at the end.

====================== Remarks ================================================

As v2 is attached, please find below a summary about the current state of this
thread:

1. v2 implements delay_time as the actual wait time (and not the intended wait
time as proposed in v1).

2. some measurements have been done to check the impact of this new
instrumentation (see this email and [2]): no noticeable performance degradation
has been observed (and surprisingly that's the opposite as mentioned above).

3. there is an ongoing discussion about exposing the number of waits [2].

4. there is an ongoing discussion about exposing the effective cost limit [3].

5. that could be interesting to have a closer look as to why the leader is waiting
during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's
probably out of scope for this thread).

[1]: https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal
[3]: https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v2-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patch text/x-diff 5.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-06-13 12:28:57 MultiXactMemberFreezeThreshold can make autovacuum *less* aggressive
Previous Message Daniel Gustafsson 2024-06-13 11:56:22 Re: Add support to TLS 1.3 cipher suites and curves lists