Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: alvherre(at)alvh(dot)no-ip(dot)org, jeff(dot)janes(at)gmail(dot)com, mk(at)071(dot)ovh, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size
Date: 2021-07-29 07:20:38
Message-ID: 20210729.162038.534808353849568395.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

At Wed, 28 Jul 2021 11:38:28 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> This test is still unstable :-(
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-07-20%2012%3A46%3A11
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-20%2015%3A05%3A39
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
>
> These all look like
>
> # Failed test 'check that segments have been removed'
> # at t/019_replslot_limit.pl line 226.
> # got: '000000010000000000000020'
> # expected: '000000010000000000000024'
> # Looks like you failed 1 test of 16.
>
> with varying values mentioned. It looks to me like WAL file cleanup
> is not as synchronous with slot creation as the test thinks.
> Maybe it needs to loop until the oldest WAL file matches what it
> expects?

Sorry for the kludge.

Mmm. In the failure cases, directory scan(@16:52:22.036) runs before
the targetted checkpoint completes(@16:52:22.144).

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-07-28%2014%3A33%3A01
16:52:17.328 LOG: checkpoint starting: wal
16:52:19.140 LOG: invalidating slot "rep1" because its restart_lsn 0/1D00000 exceeds max_slot_wal_keep_size
16:52:19.316 019_replslot_limit.pl LOG: statement: SELECT pg_walfile_name(lsn) FROM pg_create_physical_replication_slot('s2', true)
16:52:22.036 019_replslot_limit.pl LOG: statement: SELECT pg_ls_dir AS f FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$' ORDER BY 1 LIMIT
16:52:22.077 019_replslot_limit.pl LOG: statement: SELECT pg_drop_replication_slot('s2')
16:52:22.144 LOG: checkpoint complete: wrote 18 buffers (14.1%); 0 WAL file(s) added, 4 removed, 3 recycled; write=1.806 s, sync=0.001 s, total=4.817 s; sync files=0, longest=0.000 s, average=0.000 s; distance=3072 kB, estimate=3072 kB

The reason is the previous checkpoint completed after starting to
advance segments

> my $logstart = get_log_size($node_primary);
> advance_wal($node_primary, 7); !!!! another checkpoint runs/ends
...
16:52:19.140 # check for "invalidate slots" in log.
# check for the "lost" state in pg_replication_slots.
!! checkfor "checkpint complete" in log.
16:52:22.077 # read redo segment and oldest wal.
16:52:22.144 !! The target checkpoint ends.

So it seems to me we need to explicitly prevent unexpected checkpoints
from happening maybe by enlarging max_wal_size temporily.

I'll going that way.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kämpf 2021-07-29 07:35:59 AW: AW: BUG #16858: clang10-devel packages are missing in SLES15 SP2 in general
Previous Message David Rowley 2021-07-29 05:44:07 Re: BUG #17127: drop column cann't delete from pg_attribute, so it will up to 1600 limits soon