From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | alvherre(at)2ndquadrant(dot)com |
Cc: | jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-04-07 07:30:43 |
Message-ID: | 20200407.163043.2050717072576572791.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Tue, 07 Apr 2020 12:09:05 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > it seems to me that it suffices to check restart_lsn for being invalid
> > in the couple of places where the slot's owner advances (which is the
> > two auxiliary functions for ProcessStandbyReplyMessage). I have done so
> > in the attached. There are other places where the restart_lsn is set,
> > but those seem to be used only when the slot is created. I don't think
> > we need to cover for those, but I'm not 100% sure about that.
>
> StartLogicalReplcation does
> "XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the
> restart_lsn is invalid, following call to XLogReadRecord runs into
> assertion failure. Walsender (or StartLogicalReplication) should
> correctly reject reconnection from the subscriber if restart_lsn is
> invalid.
>
> > However, the change in PhysicalConfirmReceivedLocation() breaks
> > the way slots work for pg_basebackup: apparently the slot is created
> > with a restart_lsn of Invalid and we only advance it the first time we
> > process a feedback message from pg_basebackup. I have a vague feeling
> > that that's bogus, but I'll have to look at the involved code a little
> > bit more closely to be sure about this.
>
> Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
I did that in the attached. The invalidated is shared-but-not-saved
member of a slot and initialized to false then irreversibly changed to
true when the slot loses required segment.
It is checked by the new function CheckReplicationSlotInvalidated() at
acquireing a slot and at updating slot by standby reply message. This
change stops walsender without explicitly killing but I didn't remove
that code.
When logical slot loses segment, the publisher complains as:
[backend ] LOG: slot "s1" is invalidated at 0/370001C0 due to exceeding max_slot_wal_keep_size
[walsender] FATAL: terminating connection due to administrator command
The subscriber tries to reconnect and that fails as follows:
[19350] ERROR: replication slot "s1" is invalidated
[19352] ERROR: replication slot "s1" is invalidated
...
If the publisher restarts, the message is not seen and see the
following instead.
[19372] ERROR: requested WAL segment 000000010000000000000037 has already been removed
The check is done at ReplicationSlotAcquire, some slot-related SQL
functions are affected.
=# select pg_replication_slot_advance('s1', '0/37000000');
ERROR: replication slot "s1" is invalidated
After restarting the publisher, the message changes as the same with
walsender.
=# select pg_replication_slot_advance('s1', '0/380001C0');
ERROR: requested WAL segment pg_wal/000000010000000000000037 has already been removed
Since I didn't touch restart_lsn at all so no fear for changing other
behavior inadvertently.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
0001-further-change-type-2.patch | text/x-patch | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2020-04-07 07:38:17 | Re: shared-memory based stats collector |
Previous Message | Pavel Stehule | 2020-04-07 07:29:58 | Re: proposal \gcsv |