Re: speed up a logical replica setup

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "Alexander Lakhin" <exclusion(at)gmail(dot)com>, "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: speed up a logical replica setup
Date: 2024-07-11 23:24:36
Message-ID: b1f0f8c7-8f01-4950-af77-339df3dc4684@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
> Hello Amit and Hou-San,
>
> 11.07.2024 13:21, Amit Kapila wrote:
> >
> > We don't wait for the xmin to catch up corresponding to this insert
> > and I don't know if there is a way to do that. So, we should move this
> > Insert to after the call to pg_sync_replication_slots(). It won't
> > impact the general test of pg_createsubscriber.
> >
> > Thanks to Hou-San for helping me in the analysis of this BF failure.
>
> Thank you for investigating that issue!
>
> May I ask you to look at another failure of the test occurred today [1]?
>

Thanks for the report!

You are observing the same issue that Amit explained in [1]. The
pg_create_logical_replication_slot returns the EndRecPtr (see
slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr points
to the next record and it is a future position for an idle server. That's why
the recovery takes some time to finish because it is waiting for an activity to
increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to create
additional WAL records soon, the EndRecPtr position is reached rapidly and the
recovery ends quickly.

Hayato proposes a patch [2] to create an additional WAL record that has the same
effect from you little hack: increase the LSN position to allow the recovery
finishes soon. I don't like the solution although it seems simple to implement.
As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
problem is that it is used by logical decoding but it is not exposed. [reading
the code...] When the logical replication slot is created, restart_lsn points to
the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
record replayed. Since the replication slots aren't in use, we could use the
restart_lsn from the last replication slot as a consistent LSN.

I'm attaching a patch that implements it.It runs in 6s instead of 26s.

[1] https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachment Content-Type Size
0001-pg_createsubscriber-fix-slow-recovery.patch text/x-patch 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-07-11 23:52:09 Re: Use read streams in CREATE DATABASE command when the strategy is wal_log
Previous Message Paul George 2024-07-11 21:50:35 Re: Eager aggregation, take 3