Re: Use virtual tuple slot for Unique node

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Денис Смирнов <darthunix(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use virtual tuple slot for Unique node
Date: 2023-10-23 22:59:52
Message-ID: CAApHDvpXz8wgXj7S13aE0MnkZeQjMAMw8q_Q3i4CrR=3gOOuKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> I ran my experiments again. It seems on my machine the execution times
> do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> average of execution times. I did this three times. For each run the
> standard deviation was within 2%. Here are the numbers
> master: 13548.33, 13878.88, 14572.52
> master + 0001: 13734.58, 14193.83, 14574.73
>
> So for me, I would say, this particular query performs the same with
> or without patch.

I'm not really sure which of the 7 queries you're referring to here.
The times you're quoting seem to align best to Q7 from your previous
results, so I'll assume you mean Q7.

I'm not really concerned with Q7 as both patched and unpatched use
TTSOpsMinimalTuple.

I also think you need to shrink the size of your benchmark down. With
1 million tuples, you're more likely to be also measuring the time it
takes to get cache lines from memory into the CPU. A smaller scale
test will make this less likely. Also, you'd be better timing SELECT
rather than the time it takes to EXPLAIN ANALYZE. They're not the same
thing. EXPLAIN ANALYZE has additional timing going on and we may end
up not de-toasting toasted Datums.

> On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > We can see that Q2 and Q3 become a bit slower. This makes sense as
> > tts_virtual_materialize() is quite a bit more complex than
> > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> >
>
> If the source slot is a materialized virtual slot,
> tts_virtual_copyslot() could perform a memcpy of the materialized data
> itself rather than materialising from datums. That might be more
> efficient.

I think you're talking about just performing a memcpy() of the
VirtualTupleTableSlot->data field. Unfortunately, you'd not be able
to do just that as you'd also need to repoint the non-byval Datums in
tts_values at the newly memcpy'd memory. If you skipped that part,
those would remain pointing to the original memory. If that memory
goes away, then bad things will happen. I think you'd still need to do
the 2nd loop in tts_virtual_materialize()

> > We'd likely see Q2 and Q3 do better with the patched version if there
> > were more duplicates as there'd be less tuple deforming going on
> > because of the virtual slots.
> >
> > Overall, the patched version is 5.55% faster than master. However,
> > it's pretty hard to say if we should do this or not. Q3 has a mix of
> > varlena and byval types and that came out slower with the patched
> > version.
>
> Theoretically using the same slot type is supposed to be faster. We
> use same slot types for input and output in other places where as
> well.

Which theory?

> May be we should fix the above said inefficiency in
> tt_virtual_copyslot()?

I don't have any bright ideas on how to make tts_virtual_materialize()
itself faster. If there were some way to remember !attbyval
attributes for the 2nd loop, that might be good, but creating
somewhere to store that might result in further overheads.

tts_virtual_copyslot() perhaps could be sped up a little by doing a
memcpy of the values/isnull arrays when the src and dst descriptors
have the same number of attributes. aka, something like:

if (srcdesc->natts == dstslot->tts_tupleDescriptor->natts)
memcpy(dstslot->tts_values, srcslot->tts_values,
MAXALIGN(srcdesc->natts * sizeof(Datum)) +
MAXALIGN(srcdesc->natts * sizeof(bool)));
else
{
for (int natt = 0; natt < srcdesc->natts; natt++)
{
dstslot->tts_values[natt] = srcslot->tts_values[natt];
dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
}
}

I imagine we'd only start to notice gains by doing that for larger
natts values. Each of the 7 queries here, I imagine, wouldn't have
enough columns for it to make much of a difference.

That seems valid enough to do based on how MakeTupleTableSlot()
allocates those arrays. ExecSetSlotDescriptor() does not seem on board
with the single allocation method, however. (Those pfree's in
ExecSetSlotDescriptor() do look a bit fragile.
https://coverage.postgresql.org/src/backend/executor/execTuples.c.gcov.html
says they're never called)

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Wong 2023-10-23 23:05:44 Re: LLVM 16 (opaque pointers)
Previous Message David E. Wheeler 2023-10-23 22:58:18 Re: Patch: Improve Boolean Predicate JSON Path Docs