RE: Conflict detection for update_deleted in logical replication

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2025-01-10 11:42:54
Message-ID: OSCPR01MB149664F835452B3130BFDE3D4F51C2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Sawada-san,

Thanks for giving comments. I've created top-up patches for addressing them.

> 1. The launcher could still be sleeping even after the worker updates
> its oldest_nonremovable_xid. We compute the launcher's sleep time by
> doubling the sleep time with 3min maximum time. When I started the
> test, the launcher already entered 3min sleep, and it took a long time
> to advance the slot.xmin for the first time. I think we can improve
> this situation by having the worker send a signal to the launcher
> after updating the worker's oldest_nonremovable_xid so that it can
> quickly update the slot.xmin.

Done in 0006. Worker sends a signal when its oldest_nonremovable_xid is updated.
Also, for the testing purpose, the maximum nap time is shortened to 10s when
retain_conflict_info is enabled. This value can be tuned based on results.

> 2. The apply worker doesn't advance RetainConflictInfoPhase from the
> RCI_WAIT_FOR_LOCAL_FLUSH phase when it's busy. Regarding the phase
> transition from RCI_WAIT_FOR_LOCAL_FLUSH to RCI_GET_CANDIDATE_XID,
> we
> rely on calling maybe_advance_nonremovable_xid() (1) right after
> transitioning to RCI_WAIT_FOR_LOCAL_FLUSH phase, (2) after receiving
> 'k' message, and (3) there is no available incoming data. If we miss
> (1) opportunity (because we still need to wait for the local flush),
> we effectively need to consume all available data to call
> maybe_advance_nonremovable_xid() (note that the publisher doesn't need
> to send 'k' (keepalive) message if it sends data frequently). In the
> test, since I ran pgbench with 30 clients on the publisher and
> therefore there were some apply delays, the apply worker took 25 min
> to get out the inner apply loop in LogicalRepApplyLoop() and advance
> its oldest_nonremovable_xid. I think we need to consider having more
> opportunities to check the local flush LSN.

Done in 0007. The worker now can advance its state machine even when it is busy.
Latest flush position is updating in wait_for_local_flush() too.

> 3. If the apply worker cannot catch up, it could enter to a bad loop;
> the publisher sends huge amount of data -> the apply worker cannot
> catch up -> it needs to wait for a longer time to advance its
> oldest_nonremovable_xid -> more garbage are accumulated and then have
> the apply more slow -> (looping). I'm not sure how to deal with this
> point TBH. We might be able to avoid entering this bad loop once we
> resolve the other two points.

I hope this issue is fixed because the worker can wait the local-flush even
while they are busy.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
v21-0001-Maintain-the-oldest-non-removeable-tranasction-I.patch application/octet-stream 40.5 KB
v21-0002-Maintain-the-replication-slot-in-logical-launche.patch application/octet-stream 20.0 KB
v21-0003-Add-a-retain_conflict_info-option-to-subscriptio.patch application/octet-stream 79.8 KB
v21-0004-Add-a-tap-test-to-verify-the-management-of-the-n.patch application/octet-stream 6.6 KB
v21-0005-Support-the-conflict-detection-for-update_delete.patch application/octet-stream 25.7 KB
v21-0006-Make-launcher-wake-up-more-frequently.patch application/octet-stream 3.7 KB
v21-0007-Update-flush-location-more-frequently.patch application/octet-stream 2.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2025-01-10 12:04:14 Adding extension default version to \dx
Previous Message Dean Rasheed 2025-01-10 11:31:39 Re: psql: Add leakproof field to \dAo+ meta-command results