From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie> |
Subject: | Re: BTScanOpaqueData size slows down tests |
Date: | 2025-04-02 15:57:18 |
Message-ID: | jtjpscdpj5dxugvulavmjekmblvuroxi3tvkeyhbhp6ye5blqj@jrjrsrlr76gj |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-04-02 11:36:33 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Looking at the size of BTScanOpaqueData I am less surprised:
> > /* size: 27352, cachelines: 428, members: 17 */
> > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
> > that's gonna hurt.
>
> Ouch! I had no idea it had gotten that big. Yeah, we ought to
> do something about that.
It got a bit bigger a few years back, in
commit 0d861bbb702
Author: Peter Geoghegan <pg(at)bowt(dot)ie>
Date: 2020-02-26 13:05:30 -0800
Add deduplication to nbtree.
Because the posting list is a lot more dense, more items can be stored on each
page.
Not that it was small before either:
BTScanPosData currPos __attribute__((__aligned__(8))); /* 88 4128 */
/* --- cacheline 65 boundary (4160 bytes) was 56 bytes ago --- */
BTScanPosData markPos __attribute__((__aligned__(8))); /* 4216 4128 */
/* size: 8344, cachelines: 131, members: 16 */
/* sum members: 8334, holes: 3, sum holes: 10 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
But obviously ~3.2x can qualitatively change something.
> > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> > only when mark/restore are used?
>
> That'd be an easy way of removing about half of the problem, but
> 14kB is still too much. How badly do we need this items array?
> Couldn't we just reference the on-page items?
I think that'd require acquiring the buffer lock and/or pin more frequently.
But I know very little about nbtree.
I'd assume it's extremely rare for there to be this many items on a page. I'd
guess that something like storing having BTScanPosData->items point to an
in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.
I'm a bit confused by the "MUST BE LAST" comment:
BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */
Not clear why? Seems to be from rather long back:
commit 09cb5c0e7d6
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2006-05-07 01:21:30 +0000
Rewrite btree index scans to work a page at a time in all cases (both
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2025-04-02 16:01:57 | Re: BTScanOpaqueData size slows down tests |
Previous Message | Fujii Masao | 2025-04-02 15:57:13 | Re: SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica |