RE: Logical replication timeout

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Shlok Kyal' <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, RECHTÉ Marc <marc(dot)rechte(at)meteo(dot)fr>
Subject: RE: Logical replication timeout
Date: 2024-12-23 12:29:28
Message-ID: OSCPR01MB14966B646506E0C9B81B3A4CFF5022@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Shlok,

>
> Thanks for sharing the analysis.
>
> I tested the patch on my machine as well and it has worse performance
> for me as well.
> I came up with an alternate approach. In this approach we keep track
> of wal segment the transaction is part of. This helps to iterate
> through only required files during clean up.
>
> On my machine, I am running the testcase provided by you in [1]. It is
> generating ~1.9 million spill files. For me the transaction completed
> in 56sec.
> Cleanup (deletion of spill files) took around following time:
> With HEAD : ~ 5min
> With latest patch (attached here) : ~2min
>
> Can you test if this improves performance for you?

I'm also not sure the performance, but I can post my comments.
I'm not sure your patch can properly handle the list operations.

```
+ oldcontext = MemoryContextSwitchTo(rb->context);
+ txn->walsgmts = lappend(txn->walsgmts, curOpenSegNo);
+ MemoryContextSwitchTo(oldcontext);
+
```

IIUC lappend() accepts a point of a Datum, but here a normal value is passed.
Should we define a new struct which represents a node of list and append it
after it is palloc()'d?
Or your code is enough for some reasons?

```
/* iterate over all possible filenames, and delete them */
- for (cur = first; cur <= last; cur++)
+ foreach(cell, txn->walsgmts)
{
+ XLogSegNo curr_segno = (XLogSegNo) lfirst(cell);
char path[MAXPGPATH];

- ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, cur);
+ ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, curr_segno);
if (unlink(path) != 0 && errno != ENOENT)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+
+ if(txn->walsgmts != NIL)
+ {
+ pfree(txn->walsgmts);
+ txn->walsgmts = NIL;
+ }
```

If above comment is accepted, I feel you can use foreach_delete_current().

=======

Also, even when we optimize the truncation of files, there is a possibility that
replication is timed out. Can you also create a patch which implements [1]?

[1]: https://www.postgresql.org/message-id/CAExHW5s2_T9mULDQRKsdV72wpnA%2BNLT63cX51b51QQVEV4sG5g%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-12-23 13:00:00 Re: Make tuple deformation faster
Previous Message Vladlen Popolitov 2024-12-23 12:16:50 Re: Exporting float_to_shortest_decimal_buf(n) with Postgres 17 on Windows