RE: Use simplehash.h instead of dynahash in SMgr

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: RE: Use simplehash.h instead of dynahash in SMgr
Date: 2021-05-05 14:05:43
Message-ID: VI1PR0701MB696001353B6E41166E43BEC8F6599@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each batch might be pretty small on a typical OLTP workload, so it might not help much there, or it might hinder.

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even say that INSERT-as-SELECT would be more suited for DWH-like processing.

> But having said that, I don't think any of those possibilities should stop us speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big gains. I was able to squeeze like 300-400k function calls per second (WAL records/s), that was the point I think where I think smgropen() got abused.

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
>
> I'm not quite sure if I understand what you mean here. Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side one could also change how WAL records are generated:

1) Minimalization of records towards same repeated $block eg. Heap2 table_multi_insert() API already does this and it matters to generate more optimal stream for replay:

postgres(at)test=# create table t (id bigint primary key);
postgres(at)test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same block from Smgr perspective (this is especially visible on highly indexed tables) =>
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000108, prev 4/E70000A0, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243284, lsn: 4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 1663/16384/32794 blk 1
rmgr: Heap len (rec/tot): 63/ 63, tx: 17243284, lsn: 4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 1663/16384/32791 blk 0
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer optimization that avoid repeated calls to smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with obvious elimination of unnecessary individual calls to smgopen() via one big MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
postgres(at)test=# create table t (id bigint primary key);
postgres(at)test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres(at)test=# copy t from '/tmp/t';
=>
rmgr: Heap2 len (rec/tot): 210/ 210, tx: 17243290, lsn: 4/E9000028, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, blkref #0: rel 1663/16384/32801 blk 0
rmgr: Btree len (rec/tot): 102/ 102, tx: 17243290, lsn: 4/E9000100, prev 4/E9000028, desc: NEWROOT lev 0, blkref #0: rel 1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32804 blk 1
[..]
Here Btree it is very localized (at least when concurrent sessions are not generating WAL) and it enables Thomas's recent-buffer to kick in

DELETE is much more simple (thanks to not chewing out those Btree records) and also thanks to Thomas's recent-buffer should theoretically put much less stress on smgropen() already:
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000028, prev 4/EC002800, desc: DELETE off 1 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000060, prev 4/ED000028, desc: DELETE off 2 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED000098, prev 4/ED000060, desc: DELETE off 3 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap len (rec/tot): 54/ 54, tx: 17243296, lsn: 4/ED0000D0, prev 4/ED000098, desc: DELETE off 4 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
[..]

2) So what's missing - I may be wrong on this one - something like "index_multi_inserts" Btree2 API to avoid repeatedly overwhelming smgropen() on recovery side for same index's $buffer. Not sure it is worth the effort, though especially recent-buffer fixes that:
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/32804 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 1663/16384/32804 blk 1
right?

3) Concurrent DML sessions mixing WAL records: the buffering on backend's side of things (on private "thread" of WAL - in private memory - that would be simply "copied" into logwriter's main WAL buffer when committing/buffer full) - it would seem like an very interesting idea to limit interleaving concurrent sessions WAL records between each other and exploit the recent-buffer enhancement to avoid repeating the same calls to Smgr, wouldn't it? (I'm just mentioning it as I saw you were benchmarking it here and called out this idea).

I could be wrong though with many of those simplifications, in any case please consult with Thomas as he knows much better and is much more trusted source than me 😉

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-05-05 14:09:32 Re: [bug?] Missed parallel safety checks, and wrong parallel safety
Previous Message Robert Haas 2021-05-05 13:59:41 Re: Small issues with CREATE TABLE COMPRESSION