From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Koterov <dmitry(dot)koterov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: hash_search_with_hash_value is high in "perf top" on a replica |
Date: | 2025-02-10 08:45:05 |
Message-ID: | CAKZiRmwxSKOkvfh4CYU6JDjfkhV62Uu+5KtW60yEcKT2+HU+3Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Thomas!
On Tue, Feb 4, 2025 at 10:22 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Sun, Feb 2, 2025 at 3:44 AM Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> wrote:
> > The other direction is to split off WAL decoding, buffer lookup and maybe even pinning to a separate process from the main redo loop.
>
> Hi Ants,
>
[..]
> An assumption I just made, in remembering all that: OP didn't
> mention it but I guess that this COPY replay is probably repeatedly
> hammering the same pages from separate records here, because otherwise
> multi-insert stuff would already avoid a lot of mapping table lookups
> already?
Basic COPY (he mentioned basic CREATE SUBSCRIPTION table copy, so I
assume it's fresh): ends emitting stuff like this (note the "blk" is
increasing):
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/040069A0, prev 0/04004950, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 5
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/040077F8, prev 0/040069A0, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 6
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 759, lsn:
0/04008668, prev 0/040077F8, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16393 blk 7
rmgr: Heap2 len (rec/tot): 3122/ 3122, tx: 759, lsn:
0/040094C0, prev 0/04008668, desc: MULTI_INSERT+INIT ntuples: 192,
flags: 0x02, blkref #0: rel 1663/5/16393 blk 8
now if the table would have PK , we end up doing *massive*
INSERT_LEAFs due to lack of batched btree emision like just in heap2
case:
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn:
0/05000028, prev 0/041C8198, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 0
rmgr: Heap2 len (rec/tot): 3666/ 3666, tx: 763, lsn:
0/05000E80, prev 0/05000028, desc: MULTI_INSERT+INIT ntuples: 226,
flags: 0x00, blkref #0: rel 1663/5/16396 blk 1
[..]
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004050, prev 0/05003FD8, desc: INSERT_LEAF off: 1, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004090, prev 0/05004050, desc: INSERT_LEAF off: 2, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/050040D0, prev 0/05004090, desc: INSERT_LEAF off: 3, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004110, prev 0/050040D0, desc: INSERT_LEAF off: 4, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004150, prev 0/05004110, desc: INSERT_LEAF off: 5, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004190, prev 0/05004150, desc: INSERT_LEAF off: 6, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/050041D0, prev 0/05004190, desc: INSERT_LEAF off: 7, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004210, prev 0/050041D0, desc: INSERT_LEAF off: 8, blkref #0: rel
1663/5/16398 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 763, lsn:
0/05004250, prev 0/05004210, desc: INSERT_LEAF off: 9, blkref #0: rel
1663/5/16398 blk 1
[..]
I don't know exactly which situation Dmitry has hit while the second
scenario would be much easier to optimize. This of course reminds me
of Your earlier work on recet_buffer optimization too and Andres
mentioned some form of LRU cache to just protect the hash table for
buffer mapping lookups. Also I was under the impression that work by
Bhrath [0] could help here too to lower the number of WAL records
emitted.
@Dmitry : So if you're reading then this is a known problem for a
while (max performance ceiling, see [1]), but there are a myriad of
possible long-term solutions in the code. Outside of hacking PG
changes, you could only probably split the big table on publisher into
many smaller partitions (but not too many), and then having COPY
running on those smaller ones with some delays so that replication lag
doesn't grow too much. The worst alternative is trying to get the
fastest possible cores (it's a single threaded bottleneck). As Andres
noted you probably could try to recompile with some better -march flag
on that ARM and see how much that helps.
-J.
[0] - https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ(at)mail(dot)gmail(dot)com
[1] - https://www.postgresql.org/message-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2025-02-10 08:45:38 | Re: Fix outdated code comments in nodeAgg.c |
Previous Message | Richard Guo | 2025-02-10 08:22:22 | Re: Adjust tuples estimate for appendrels |