Re: Conflict detection for update_deleted in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2025-01-10 00:42:39
Message-ID: CAD21AoDUSd4YnyqCYhF9rrdcnMiqMmPjOhXK_hZF=c4WjP5xXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 8, 2025 at 7:26 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Hi,
>
> >
> > On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> > >
> > > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> > <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > > > wrote:
> > > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > > <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > > <nisha(dot)moond412(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Test setup:
> > > > > > >
> > > > > > > - Tests performed on pgHead + v16 patches
> > > > > > >
> > > > > > > - Created a pub-sub replication system.
> > > > > > >
> > > > > > > - Parameters for both instances were:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > share_buffers = 30GB
> > > > > > >
> > > > > > > min_wal_size = 10GB
> > > > > > >
> > > > > > > max_wal_size = 20GB
> > > > > > >
> > > > > > > autovacuum = false
> > > > > >
> > > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > > detect_update_deleted setting, is that right?
> > > > > >
> > > > >
> > > > > I think hot-pruning mechanism during the update operation will
> > > > > remove dead tuples even when autovacuum is disabled.
> > > >
> > > > True, but why did it disable autovacuum? It seems that
> > > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> > likely to happen.
> > >
> > > IIUC, we disable autovacuum as a general practice in read-write tests
> > > for stable TPS numbers.
> >
> > Okay. TBH I'm not sure what we can say with these results. At a glance, in a
> > typical bi-directional-like setup, we can interpret these results as that if
> > users turn retain_conflict_info on the TPS goes 50% down. But I'm not sure
> > this 50% dip is the worst case that users possibly face. It could be better in
> > practice thanks to autovacuum, or it also could go even worse due to further
> > bloats if we run the test longer.
>
> I think it shouldn't go worse because ideally the amount of bloat would not
> increase beyond what we see here due to this patch unless there is some
> misconfiguration that leads to one of the node not working properly (say it is
> down). However, my colleague is running longer tests and we will share the
> results soon.
>
> > Suppose that users had 50% performance dip due to dead tuple retention for
> > update_deleted detection, is there any way for users to improve the situation?
> > For example, trying to advance slot.xmin more frequently might help to reduce
> > dead tuple accumulation. I think it would be good if we could have a way to
> > balance between the publisher performance and the subscriber performance.
>
> AFAICS, most of the time in each xid advancement is spent on waiting for the
> target remote_lsn to be applied and flushed, so increasing the frequency could
> not help. This can be proved to be reasonable in the testcase 4 shared by
> Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
> commit_ts of incoming transaction to exceed the candidate_xid_time, the
> regression is still the same.

True, but I think that not only more frequently asking the publisher
its status but also the apply worker frequently trying to advance the
RetainConflictInfoPhase and the launcher frequently trying to advance
the slot.xmin are important.

> I think it indicates that we indeed need to wait
> for this amount of time before applying all the transactions that have earlier
> commit timestamp. IOW, the performance impact on the subscriber side is a
> reasonable behavior if we want to detect the update_deleted conflict reliably.

It's reasonable behavior for this approach but it might not be a
reasonable outcome for users if they could be affected by such a
performance dip without no way to avoid it.

To closely look at what is happening in the apply worker and the
launcher, I did a quick test with the same setup, where running
pgbench with 30 clients to each of the publisher and subscriber (on
different pgbench tables so conflicts don't happen on the subscriber),
and I recorded how often the worker and the launcher tried to update
the worker's xmin and slot's xmin, respectively. During the 120
seconds test I observed that the apply worker advanced its
oldest_nonremovable_xid 10 times with 43 attempts and the launcher
advanced the slot's xmin 5 times with 20 attempts, which seems to be
less frequent. And there seems no way for users to increase these
frequencies. Actually, these XID advancements happened only early in
the test and in the later part there was almost no attempt to advance
XIDs (I described the reason below). Therefore, after 120 secs tests,
slot's xmin was 2366291 XIDs behind (TPS on the publisher and
subscriber were 15728 and 18052, respectively).

I think there 3 things we need to deal with:

1. The launcher could still be sleeping even after the worker updates
its oldest_nonremovable_xid. We compute the launcher's sleep time by
doubling the sleep time with 3min maximum time. When I started the
test, the launcher already entered 3min sleep, and it took a long time
to advance the slot.xmin for the first time. I think we can improve
this situation by having the worker send a signal to the launcher
after updating the worker's oldest_nonremovable_xid so that it can
quickly update the slot.xmin.

2. The apply worker doesn't advance RetainConflictInfoPhase from the
RCI_WAIT_FOR_LOCAL_FLUSH phase when it's busy. Regarding the phase
transition from RCI_WAIT_FOR_LOCAL_FLUSH to RCI_GET_CANDIDATE_XID, we
rely on calling maybe_advance_nonremovable_xid() (1) right after
transitioning to RCI_WAIT_FOR_LOCAL_FLUSH phase, (2) after receiving
'k' message, and (3) there is no available incoming data. If we miss
(1) opportunity (because we still need to wait for the local flush),
we effectively need to consume all available data to call
maybe_advance_nonremovable_xid() (note that the publisher doesn't need
to send 'k' (keepalive) message if it sends data frequently). In the
test, since I ran pgbench with 30 clients on the publisher and
therefore there were some apply delays, the apply worker took 25 min
to get out the inner apply loop in LogicalRepApplyLoop() and advance
its oldest_nonremovable_xid. I think we need to consider having more
opportunities to check the local flush LSN.

3. If the apply worker cannot catch up, it could enter to a bad loop;
the publisher sends huge amount of data -> the apply worker cannot
catch up -> it needs to wait for a longer time to advance its
oldest_nonremovable_xid -> more garbage are accumulated and then have
the apply more slow -> (looping). I'm not sure how to deal with this
point TBH. We might be able to avoid entering this bad loop once we
resolve the other two points.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2025-01-10 01:09:42 Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
Previous Message Jeff Davis 2025-01-10 00:19:58 Re: Collation & ctype method table, and extension hooks