Re: Use heap scan routines directly in vac_update_datfrozenxid()

From: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Use heap scan routines directly in vac_update_datfrozenxid()
Date: 2024-10-14 00:31:00
Message-ID: CAE-ML+8A7eLq8ScBXFV=MuHiJtYJn1XchMeNtLR3oZptxfGZ3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 7, 2024 at 1:58 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2024-Oct-06, Tom Lane wrote:
>
> > Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> writes:
> > > Attached is a simple patch to directly use heap scan routines in
> > > vac_update_datfrozenxid(), avoiding the multilayer overhead from the
> > > sysscan infrastructure.
>
> Though if there's anybody with a Postgres fork using catalog tables that
> aren't heapam, then they aren't going to be happy with this change. (I
> remember Tom commenting that Salesforce used to do that, I wonder if
> they still do.)

I see. However, I do see many uses of table_beginscan_catalog() followed
by a heap_getnext(). So, we are assuming that the catalog = heap in those
callsites. I guess that they aren't ideal and are meant to be changed
eventually?

> > I would think the overhead of that is minuscule. If it isn't,
> > we should try to make it so, not randomly hack up some of the callers
> > to avoid it. The intention certainly was that it wouldn't cost
> > anything compared to what happens within the actual table access.
>
> I suspect the problem is not the systable layer per se, but the fact
> that it has to go through the table AM interface. So by replacing
> systable with the heap routines, it's actually _two_ layers of
> indirection that are being skipped over. systable seems indeed quite
> lean, or at least it was up to postgres 11 ... but it's not clear to me
> that it continues to be.

> The table AM API is heavily centered on slots, and I think having to
> build heap tuples from slots might be slow. I wonder if it would be
> better to add "systable_getnext_slot" which returns a slot and omit the
> conversion to heaptuple. Callers for which it's significant could skip
> that conversion by dealing with a slot instead. Converting just one or
> two critical spots (such as vac_update_datfrozenxid, maybe
> pg_publication.c) should be easy enough.
>

We aren't building heap tuples AFAICS. Attaching a debugger, I saw
that we are calling:

tts_buffer_heap_store_tuple execTuples.c:981
ExecStoreBufferHeapTuple execTuples.c:1493
heap_getnextslot heapam.c:1316
table_scan_getnextslot tableam.h:1071
systable_getnext genam.c:538
vac_update_datfrozenxid vacuum.c:1644

where we store the HeapTuple obtained from heapgettup() into the slot:
bslot->base.tuple = tuple;

And then subsequently, we obtain the tuple directly via:

tts_buffer_heap_get_heap_tuple execTuples.c:909
ExecFetchSlotHeapTuple genam.c:542
systable_getnext genam.c:542
vac_update_datfrozenxid vacuum.c:1644

We don't go through a slot->tts_ops->copy_heap_tuple() call, so we don't
copy the heap tuple. Here we just return: bslot->base.tuple.

At any rate, there is some (albeit small) overhead due to slots being in the
picture. This is an excerpt from perf on the main branch with my workload:

main (with CFLAGS='-O0 -fno-omit-frame-pointer -ggdb', without asserts):

Children Self Command Symbol
...
- 81.59% 2.65% postgres [.] systable_getnext
- 78.93% systable_getnext
- 76.61% table_scan_getnextslot
- 73.16% heap_getnextslot
+ 57.95% heapgettup_pagemode
- 9.51% ExecStoreBufferHeapTuple
- 7.65% tts_buffer_heap_store_tuple
+ 2.47% IncrBufferRefCount
+ 2.04% ReleaseBuffer
- 1.83% ExecFetchSlotHeapTuple
0.83% tts_buffer_heap_get_heap_tuple
+ 2.48% _start

We can see some overhead in ExecStoreBufferHeapTuple() and
ExecFetchSlotHeapTuple(), both of which are avoided in my branch.

I am attaching perf diff (with --dsos=postgres) between -O3 branch and -O3 main
(with branch as the baseline), as well as the individual profiles
(--report with --stdio
flag and --dsos=postgres, profiles captured with --call-graph=dwarf). I ran the
same workload as in my previous email.

To your point about having a slots based API, where the caller will
manipulate the slot directly: that unfortunately won't help us avoid
the overhead
seen in storing the tuple in the slot and then in extracting it out.

To Tom's point about fixing systable_getnext() to remove the overhead, there is
the option of substituting the call inside to table_scan_getnextslot()
with heap_getnext().
However, that isn't fair to non-heap catalogs I suppose.

Regards,
Soumyadeep (Broadcom)

Attachment Content-Type Size
branchO3_vs_mainO3.diff text/x-patch 2.9 KB
perf.main.O3.data.out application/octet-stream 56.4 KB
perf.branch.O3.data.out application/octet-stream 38.1 KB
perf.main.O0.data.out application/octet-stream 753.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-10-14 03:29:31 Re: \watch 0 or \watch 0.00001 doesn't do what I want
Previous Message Peter Smith 2024-10-13 22:57:22 Re: Add contrib/pg_logicalsnapinspect