From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-04-27 22:33:42 |
Message-ID: | 20200427223342.GA23152@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2020-Apr-08, Kyotaro Horiguchi wrote:
> At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>
> Just avoiding starting replication when restart_lsn is invalid is
> sufficient (the attached, which is equivalent to a part of what the
> invalidated flag did). I thing that the error message needs a Hint but
> it looks on the subscriber side as:
>
> [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated
> HINT: The slot exceeds the limit by max_slot_wal_keep_size.
>
> I don't think it is not clean.. Perhaps the subscriber should remove
> the trailing line of the message from the publisher?
Thanks for the fix! I propose two changes:
1. reword the error like this:
ERROR: replication slot "regression_slot3" cannot be advanced
DETAIL: This slot has never previously reserved WAL, or has been invalidated
2. use the same error in one other place, to wit
pg_logical_slot_get_changes() and pg_replication_slot_advance(). I
made the DETAIL part the same in all places, but the ERROR line is
adjusted to what each callsite is doing.
I do think that this change in test_decoding is a bit unpleasant:
-ERROR: cannot use physical replication slot for logical decoding
+ERROR: cannot get changes from replication slot "repl"
The test is
-- check that we're detecting a streaming rep slot used for logical decoding
SELECT 'init' FROM pg_create_physical_replication_slot('repl');
SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> > On the other hand, physical replication doesn't break by invlidation.
> > [...]
>
> If we don't mind that standby can reconnect after a walsender
> termination due to the invalidation, we don't need to do something for
> this. Restricting max_slot_wal_keep_size to be larger than a certain
> threshold would reduce the chance we see that behavior.
Yeah, I think you're referring to the fact that StartReplication()
doesn't verify the restart_lsn of the slot; and if we do add a check, a
few tests that rely on physical replication start to fail. This patch
only adds a comment in that spot. But I don't (yet) know what the
consequences of this are, or whether it can be fixed by setting a valid
restart_lsn ahead of time. This test in pg_basebackup fails, for
example:
# Running: pg_basebackup -D /home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl -X stream -S slot1
pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR: cannot read from replication slot "slot1"
DETAIL: This slot has never previously reserved WAL, or has been invalidated
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory "/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
not ok 95 - pg_basebackup -X stream with replication slot runs
# Failed test 'pg_basebackup -X stream with replication slot runs'
# at t/010_pg_basebackup.pl line 461.
Anyway I think the current patch can be applied as is -- and if we want
physical replication to have some other behavior, we can patch for that
afterwards.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-check-slot-restart_lsn-in-a-couple-of-places.patch | text/x-diff | 5.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2020-04-27 23:01:38 | Re: [BUG] non archived WAL removed during production crash recovery |
Previous Message | Darafei Komяpa Praliaskouski | 2020-04-27 18:48:47 | Re: Parallel GiST build on Cube |