From: | Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Use heap scan routines directly in vac_update_datfrozenxid() |
Date: | 2024-10-14 00:31:00 |
Message-ID: | CAE-ML+8A7eLq8ScBXFV=MuHiJtYJn1XchMeNtLR3oZptxfGZ3w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 7, 2024 at 1:58 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2024-Oct-06, Tom Lane wrote:
>
> > Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> writes:
> > > Attached is a simple patch to directly use heap scan routines in
> > > vac_update_datfrozenxid(), avoiding the multilayer overhead from the
> > > sysscan infrastructure.
>
> Though if there's anybody with a Postgres fork using catalog tables that
> aren't heapam, then they aren't going to be happy with this change. (I
> remember Tom commenting that Salesforce used to do that, I wonder if
> they still do.)
I see. However, I do see many uses of table_beginscan_catalog() followed
by a heap_getnext(). So, we are assuming that the catalog = heap in those
callsites. I guess that they aren't ideal and are meant to be changed
eventually?
> > I would think the overhead of that is minuscule. If it isn't,
> > we should try to make it so, not randomly hack up some of the callers
> > to avoid it. The intention certainly was that it wouldn't cost
> > anything compared to what happens within the actual table access.
>
> I suspect the problem is not the systable layer per se, but the fact
> that it has to go through the table AM interface. So by replacing
> systable with the heap routines, it's actually _two_ layers of
> indirection that are being skipped over. systable seems indeed quite
> lean, or at least it was up to postgres 11 ... but it's not clear to me
> that it continues to be.
> The table AM API is heavily centered on slots, and I think having to
> build heap tuples from slots might be slow. I wonder if it would be
> better to add "systable_getnext_slot" which returns a slot and omit the
> conversion to heaptuple. Callers for which it's significant could skip
> that conversion by dealing with a slot instead. Converting just one or
> two critical spots (such as vac_update_datfrozenxid, maybe
> pg_publication.c) should be easy enough.
>
We aren't building heap tuples AFAICS. Attaching a debugger, I saw
that we are calling:
tts_buffer_heap_store_tuple execTuples.c:981
ExecStoreBufferHeapTuple execTuples.c:1493
heap_getnextslot heapam.c:1316
table_scan_getnextslot tableam.h:1071
systable_getnext genam.c:538
vac_update_datfrozenxid vacuum.c:1644
where we store the HeapTuple obtained from heapgettup() into the slot:
bslot->base.tuple = tuple;
And then subsequently, we obtain the tuple directly via:
tts_buffer_heap_get_heap_tuple execTuples.c:909
ExecFetchSlotHeapTuple genam.c:542
systable_getnext genam.c:542
vac_update_datfrozenxid vacuum.c:1644
We don't go through a slot->tts_ops->copy_heap_tuple() call, so we don't
copy the heap tuple. Here we just return: bslot->base.tuple.
At any rate, there is some (albeit small) overhead due to slots being in the
picture. This is an excerpt from perf on the main branch with my workload:
main (with CFLAGS='-O0 -fno-omit-frame-pointer -ggdb', without asserts):
Children Self Command Symbol
...
- 81.59% 2.65% postgres [.] systable_getnext
- 78.93% systable_getnext
- 76.61% table_scan_getnextslot
- 73.16% heap_getnextslot
+ 57.95% heapgettup_pagemode
- 9.51% ExecStoreBufferHeapTuple
- 7.65% tts_buffer_heap_store_tuple
+ 2.47% IncrBufferRefCount
+ 2.04% ReleaseBuffer
- 1.83% ExecFetchSlotHeapTuple
0.83% tts_buffer_heap_get_heap_tuple
+ 2.48% _start
We can see some overhead in ExecStoreBufferHeapTuple() and
ExecFetchSlotHeapTuple(), both of which are avoided in my branch.
I am attaching perf diff (with --dsos=postgres) between -O3 branch and -O3 main
(with branch as the baseline), as well as the individual profiles
(--report with --stdio
flag and --dsos=postgres, profiles captured with --call-graph=dwarf). I ran the
same workload as in my previous email.
To your point about having a slots based API, where the caller will
manipulate the slot directly: that unfortunately won't help us avoid
the overhead
seen in storing the tuple in the slot and then in extracting it out.
To Tom's point about fixing systable_getnext() to remove the overhead, there is
the option of substituting the call inside to table_scan_getnextslot()
with heap_getnext().
However, that isn't fair to non-heap catalogs I suppose.
Regards,
Soumyadeep (Broadcom)
Attachment | Content-Type | Size |
---|---|---|
branchO3_vs_mainO3.diff | text/x-patch | 2.9 KB |
perf.main.O3.data.out | application/octet-stream | 56.4 KB |
perf.branch.O3.data.out | application/octet-stream | 38.1 KB |
perf.main.O0.data.out | application/octet-stream | 753.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2024-10-14 03:29:31 | Re: \watch 0 or \watch 0.00001 doesn't do what I want |
Previous Message | Peter Smith | 2024-10-13 22:57:22 | Re: Add contrib/pg_logicalsnapinspect |