From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | alvherre(at)2ndquadrant(dot)com |
Cc: | jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-04-08 05:19:56 |
Message-ID: | 20200408.141956.891237856186513376.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > I pushed version 26, with a few further adjustments.
> >
> > I think what we have now is sufficient, but if you want to attempt this
> > "invalidated" flag on top of what I pushed, be my guest.
>
> I don't think the invalidation flag is essential but it can prevent
> unanticipated behavior, in other words, it makes us feel at ease:p
>
> After the current master/HEAD, the following steps causes assertion
> failure in xlogreader.c.
..
> I will look at it.
Just avoiding starting replication when restart_lsn is invalid is
sufficient (the attached, which is equivalent to a part of what the
invalidated flag did). I thing that the error message needs a Hint but
it looks on the subscriber side as:
[22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated
HINT: The slot exceeds the limit by max_slot_wal_keep_size.
I don't think it is not clean.. Perhaps the subscriber should remove
the trailing line of the message from the publisher?
> On the other hand, physical replication doesn't break by invlidation.
>
> Primary: postgres.conf
> max_slot_wal_keep_size=0
> Standby: postgres.conf
> primary_conninfo='connect to master'
> primary_slot_name='x1'
>
> (start the primary)
> P=> select pg_create_physical_replication_slot('x1');
> (start the standby)
> S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;
If we don't mind that standby can reconnect after a walsender
termination due to the invalidation, we don't need to do something for
this. Restricting max_slot_wal_keep_size to be larger than a certain
threshold would reduce the chance we see that behavior.
I saw another issue, the following sequence on the primary freezes
when invalidation happens.
=# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal(); checkpoint;
The last checkpoint command is waiting for CV on
CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
is waiting for the next latch at the end of
CheckpointerMain. new_started doesn't move but it is the same value
with old_started.
That freeze didn't happen when I removed
ConditionVariableSleep(&s->active_cv) in
InvalidateObsoleteReplicationSlots.
I continue investigating it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
0001-walsender-crash-fix.patch | text/x-patch | 1.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-04-08 05:43:57 | Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch) |
Previous Message | Fujii Masao | 2020-04-08 05:15:38 | Re: backup manifests |