Re: Make tuple deformation faster

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Make tuple deformation faster
Date: 2024-10-09 13:59:45
Message-ID: CAApHDvpwd76-goJ3J-g_VQEzhqqb7F-3Kd70LXNrS23UHYSLBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 4 Oct 2024 at 10:23, Andres Freund <andres(at)anarazel(dot)de> wrote:
> A few weeks ago David and I discussed this patch. We were curious *why* the
> flags approach was slower. It turns out that, at least on my machine, this is
> just a compiler optimization issue. Putting a pg_compiler_barrier() just
> after:
>
> > for (; attnum < natts; attnum++)
> > {
> > - Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);
> > + CompactAttribute *thisatt = TupleDescCompactAttr(tupleDesc, attnum);
>
> addressed the issue. The problem basically is that instead of computing the
> address of thisatt once, gcc for some reason instead uses complex addressing
> lea instructions, which are slower on some machines.
>
> Not sure what a good way to deal with that is. I haven't tried, but it could
> be that just advancing thisatt using++ would do the trick?

I'm not seeing the same thing when I test with GCC. Subsequent calls
to TupleDescCompactAttr() are just adding 8 to the pointer value in
the register. Here's perf annotation output using gcc11.4.0:

patches 0001-0004 applied:

│ for (; attnum < natts; attnum++)
1.02 │ f7: add $0x1,%r12d
0.02 │ add $0x8,%r15
2.81 │ add $0x1,%r14
11.02 │ add $0x8,%rbp

and with 0005 applied:

│ for (; attnum < natts; attnum++)
4.45 │ e9: add $0x1,%r12d
0.09 │ add $0x8,%r15
0.09 │ add $0x1,%r13
9.09 │ add $0x10,%rbx

You can see the final add switches to adding 16.

With clang, it's a different story. I'm unsure what happened to the
debug symbols, but I think it's this part:

0.30 │ ↓ jne 220
1.10 │1e9: mov 0x30(%rsp),%rcx
0.62 │ mov %rax,(%rcx,%r15,8)

and with 0005:

0.70 │ movzbl %bpl,%ebp
1.34 │ cmovle %r8d,%ebp
1.95 │ d1: add $0x1,%r15
1.20 │ add $0x10,%r13

With clang, the performance does improve quite a bit after applying
0005. 152 tps with 0001-0004 and 167 tps with 0005. See attached
bench_v4.txt. I tried pg_compiler_barrer() after the
TupleDescCompactAttr() in slot_deform_heap_tuple() and it does not
switch using "add".

> I think this generally looks quite good and the performance wins are quite
> nice! I'm a bit concerned about the impact on various extensions - I haven't
> looked, but I wouldn't be surprised if this requires a bunch of of changes.
> Perhaps we could reduce that a bit?
>
> Could it make sense to use bitfields instead of flag values, to reduce the
> impact?

Yeah. That's a good idea. Using bitflags allows me to get rid of the
macros I had which were extracting the bool values from attflags.
That allows me to leave the fetchatt() macro along so it still uses
the attbyval field. With bitflags both Form_pg_attribute and
CompactAttribute have that field, so the macro works for both. I think
this is likely the easiest way to reduce the number of changes and not
break existing users of that macro.

> > +#else
> > +/* Accessor for the i'th CompactAttribute of tupdesc */
> > +#define TupleDescCompactAttr(tupdesc, i) (&(tupdesc)->compact_attrs[(i)])
> > +#endif
>
> Personally I'd have the ifdef inside the static inline function, rather than
> changing between a static inline and a macro depending on
> USE_ASSERT_CHECKING..

Yeah, that's a good idea. I must have been trying to keep it a macro
the same as TupleDescAttr().

> > -#define fetchatt(A,T) fetch_att(T, (A)->attbyval, (A)->attlen)
> > +#define fetchatt(A, T) fetch_att(T, CompactAttrByVal(A), (A)->attlen)
>
> Stuff like this seems like it might catch some extensions unaware. I think it
> might make sense to change macros like this to be a static inline, so that you
> get proper type mismatch errors, rather than errors about invalid casts or
> nonexistant fields.

That's no longer required after switching to bitflags since the
CompactAttribute has an attbyval field again so I've left this macro
untouched in the latest patch.

> > From c75d072b8bc43ba4f9b7fbe2f99b65edc7421a15 Mon Sep 17 00:00:00 2001
> > From: David Rowley <dgrowley(at)gmail(dot)com>
> > Date: Wed, 29 May 2024 12:19:03 +1200
> > Subject: [PATCH v3 3/5] Optimize alignment calculations in tuple form/deform
> >
> > This converts CompactAttribute.attalign from a char which is directly
> > derived from pg_attribute.attalign into a uint8 which specifies the
> > number of bytes to align the column by. Also, rename the field to
> > attalignby to make the distinction more clear in code.
> >
> > This removes the complexity of checking each char value and transforming
> > that into the appropriate alignment call. This can just be a simple
> > TYPEALIGN passing in the number of bytes.
>
> I like this a lot.

Great.

> > diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
> > index 08772de39f..b66eb178b9 100644
> > --- a/contrib/amcheck/verify_heapam.c
> > +++ b/contrib/amcheck/verify_heapam.c
> > @@ -1592,7 +1592,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
> > /* Skip non-varlena values, but update offset first */
> > if (thisatt->attlen != -1)
> > {
> > - ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
> > + ctx->offset = att_nominal_alignby(ctx->offset, thisatt->attalignby);
> > ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
> > tp + ctx->offset);
> > if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
>
> A bit confused about the change in naming policy here...

The idea here is that after making CompactAttribute's attalign a uint8
of bytes instead of the char attalign, I renamed it "attalignby" so
that any code using a field named attalign wouldn't compile. That
gives me a bit more confidence when changing FormData_pg_attributes
into CompactAttributes that nothing is looking at the uint8 and
treating it like the attalign char. The reason I swapped the order of
the words around is because att_alignby_nominal() didn't really sound
correct. I was wondering if I'd keep that macro at all and not just
use TYPEALIGN() directly...

> > From d1ec19a46a480b0c75f9df468b2765ad4e51dce2 Mon Sep 17 00:00:00 2001
> > From: David Rowley <dgrowley(at)gmail(dot)com>
> > Date: Tue, 3 Sep 2024 14:05:30 +1200
> > Subject: [PATCH v3 5/5] Try a larger CompactAttribute struct without flags
> >
> > Benchmarks have shown that making the CompactAttribute struct larger and
> > getting rid of the flags to reduce the bitwise-ANDing requirements makes
> > things go faster.
>
> I think we have some way of not needing this. I don't like my compiler barrier
> hack, but I'm sure we can hold the hands of the compiler sufficiently to
> generate useful code.

I'm not sure what to do about this part. I did quite a bit of staring
at the attached benchmark results. Using gcc13.2 on my Zen2 machine,
the 0005 patch is quite a bit faster (138 tps vs 127 tps) than with
just 0004 and both versions use an "add" instruction to bump to the
next CompactAttribute element. So, with that CPU, maybe it's just
faster from getting rid of the bitwise-AND code to extract the
booleans.

> It'd sure be nice if we could condense some of these fields in pg_attribute
> too. It obviously shouldn't be a requirement for this patch, don't get me
> wrong!
>
> Production systems have often very large pg_attribute tables, which makes this
> a fairly worthwhile optimization target.
>
> I wonder if we could teach the bootstrap and deform logic about bool:1 or such
> and have it generate the right code for that.

It would be good to do something about this. I think the reduction in
the sizeof(FormData_pg_attribute) would be good for using less memory
in RelationData.rd_att.

David

Attachment Content-Type Size
bench_v4.txt text/plain 7.0 KB
v4-0001-Move-TupleDesc.attrs-out-of-line.patch application/octet-stream 6.4 KB
v4-0002-Introduce-CompactAttribute-array-in-TupleDesc.patch application/octet-stream 53.2 KB
v4-0003-Optimize-alignment-calculations-in-tuple-form-def.patch application/octet-stream 18.4 KB
v4-0004-Remove-pg_attribute.attcacheoff-column.patch application/octet-stream 11.5 KB
v4-0005-Try-a-larger-CompactAttribute-struct-without-bitf.patch application/octet-stream 1.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-10-09 14:05:16 Re: Allow default \watch interval in psql to be configured
Previous Message Heikki Linnakangas 2024-10-09 13:59:35 Re: \watch 0 or \watch 0.00001 doesn't do what I want