Re: Make tuple deformation faster

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Make tuple deformation faster
Date: 2025-03-10 03:24:14
Message-ID: CAApHDvrRiBF2YnACgcV_=k_bHjUrLczFfU4hQPJO2VRPFczDbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 6 Mar 2025 at 10:17, Andres Freund <andres(at)anarazel(dot)de> wrote:
> FWIW, I am fairly certain that I looked at this at an earlier state of the
> patch, and at least for me the issue wasn't that it was inherently slower to
> use the bitmask, but that it was hard to convince the compiler not generate
> worse code.
>
> IIRC the compiler generated more complicated address gathering instructions
> which are slower on some older microarchitectures, but this is a vague memory.

I've been reading GCC's assembly output with -fverbose-asm. I find it
quite hard to follow as the changes between the 16-byte and 8-byte
CompactAttribute versions are vast (see attached).

A few interesting things jump out. e.g, in master:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1468
movl %ebp, (%rax) # off, MEM[(int *)_22]

whereas with the 8-byte version, I see:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1484
movl %ebp, 24(%rax) # off, MEM[(int *)_358 + 24B]

You can see the MOVL in the 8-byte version should amount to an
additional micro op to add 24 to RAX before the dereference.

One interesting thing to note about having CompactAttribute in its
8-byte form is that the compiler is tempted into sharing a register
with the tts_values array before Datum is also 8-bytes. Note the
difference in [1] between the two left compiler outputs and the
right-hand one. You can see RCX is dedicated for addressing
CompactAttribute in the right window, but RAX is used for both arrays
in the left two. I don't 100% know for sure that's the reason for the
slowness with the full version but it does seem from the fragment I
posted just above that RAX does need 24 bytes added in the 8 bytes
version but not in the 16 byte version, so RAX is certainly not
dedicated and ready pointing to attcacheoff at that point.

Jeff, I'm not sure if I understand this well enough to write a
meaningful comment to explain why we don't use bitflags. With my
current knowledge level on this, it's a bit hand-wavy at best. Are you
content with this, or do you want to see something written into the
header comment for CompactAttribute in the code?

David

[1] https://godbolt.org/z/7hWvqdW6E

Attachment Content-Type Size
slot_deform_heap_tuple.zip application/x-zip-compressed 19.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-03-10 03:33:54 Re: Add an option to skip loading missing publication to avoid logical replication failure
Previous Message Erik Wienhold 2025-03-10 02:46:05 Re: CREATE OR REPLACE MATERIALIZED VIEW