Re: BTScanOpaqueData size slows down tests

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: BTScanOpaqueData size slows down tests
Date: 2025-04-02 15:57:18
Message-ID: jtjpscdpj5dxugvulavmjekmblvuroxi3tvkeyhbhp6ye5blqj@jrjrsrlr76gj
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-04-02 11:36:33 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Looking at the size of BTScanOpaqueData I am less surprised:
> > /* size: 27352, cachelines: 428, members: 17 */
> > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
> > that's gonna hurt.
>
> Ouch! I had no idea it had gotten that big. Yeah, we ought to
> do something about that.

It got a bit bigger a few years back, in

commit 0d861bbb702
Author: Peter Geoghegan <pg(at)bowt(dot)ie>
Date: 2020-02-26 13:05:30 -0800

Add deduplication to nbtree.

Because the posting list is a lot more dense, more items can be stored on each
page.

Not that it was small before either:

BTScanPosData currPos __attribute__((__aligned__(8))); /* 88 4128 */
/* --- cacheline 65 boundary (4160 bytes) was 56 bytes ago --- */
BTScanPosData markPos __attribute__((__aligned__(8))); /* 4216 4128 */

/* size: 8344, cachelines: 131, members: 16 */
/* sum members: 8334, holes: 3, sum holes: 10 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

But obviously ~3.2x can qualitatively change something.

> > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> > only when mark/restore are used?
>
> That'd be an easy way of removing about half of the problem, but
> 14kB is still too much. How badly do we need this items array?
> Couldn't we just reference the on-page items?

I think that'd require acquiring the buffer lock and/or pin more frequently.
But I know very little about nbtree.

I'd assume it's extremely rare for there to be this many items on a page. I'd
guess that something like storing having BTScanPosData->items point to an
in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.

I'm a bit confused by the "MUST BE LAST" comment:
BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */

Not clear why? Seems to be from rather long back:

commit 09cb5c0e7d6
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2006-05-07 01:21:30 +0000

Rewrite btree index scans to work a page at a time in all cases (both

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2025-04-02 16:01:57 Re: BTScanOpaqueData size slows down tests
Previous Message Fujii Masao 2025-04-02 15:57:13 Re: SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica