Re: BTScanOpaqueData size slows down tests

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BTScanOpaqueData size slows down tests
Date: 2025-04-02 17:07:29
Message-ID: 4434c0a9-4e04-4130-bd88-23619873a48d@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/2/25 17:45, Peter Geoghegan wrote:
> On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ouch! I had no idea it had gotten that big. Yeah, we ought to
>> do something about that.
>
> Tomas Vondra talked about this recently, in the context of his work on
> prefetching.
>

I might have mentioned in the context of index prefetching (because that
has to touch this, naturally), but I actually ran into this when working
on the fast-path locking [1].

[1]
https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com

One of the tests I did was with partitions, and with an index scans on
tiny partitions that got pretty awful simply because of malloc() calls.
The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached,
and even if it could we would not cache it across scans anyway.

>>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
>>> only when mark/restore are used?
>>
>> That'd be an easy way of removing about half of the problem, but
>> 14kB is still too much. How badly do we need this items array?
>> Couldn't we just reference the on-page items?
>
> I'm not sure what you mean by that. The whole design of _bt_readpage
> is based on the idea that we read a whole page, in one go. It has to
> batch up the items that are to be returned from the page somewhere.
> The worst case is that there are about 1350 TIDs to return from any
> single page (assuming default BLCKSZ). It's very pessimistic to start
> from the assumption that that worst case will be hit, but I don't see
> a way around doing it at least some of the time.
>
> The first thing I'd try is some kind of simple dynamic allocation
> scheme, with a small built-in array that avoided any allocation
> penalty in the common case where there weren't too many tuples to
> return from the page.
>
> The way that we allocate BLCKSZ twice for index-only scans (one for
> so->currTuples, the other for so->markTuples) is also pretty
> inefficient. Especially because any kind of use of mark and restore is
> exceedingly rare.
>

Yeah, something like this (allocating smaller arrays unless more is
actually needed) would help many common cases.

Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or
the same thing using mallopt), so that glibc keeps "buffer" for future
allocations.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2025-04-02 17:18:33 Re: [PATCH] Add sortsupport for range types and btree_gist
Previous Message Andres Freund 2025-04-02 17:06:05 Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions