Quick Links

Re: WIP: Generic functions for Node types using generated metadata

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Generic functions for Node types using generated metadata
Date:	2019-10-02 20:46:30
Message-ID:	20191002204630.6scxtimq5xqzk64k@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2019-10-02 14:47:22 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I'm afraid that's going to be a deal-breaker for lots of people.
> >> It's fine for prototyping the idea but we'll need to find another
> >> implementation before we can move to commit.
>
> > Why do you think it will be a deal-breaker for lots of people? I mean,
> > if we get this to a point where it just requires installing some
> > reasonably modern version of LLVM, I don't see why that's worse than
> > having to do the same thing for say, Perl if you want to build
> > --with-perl, or Python if you want to build --with-python, or bison or
> > lex if you want to change the lexer and parser. One more build-time
> > dependency shouldn't be a big deal, as long as we don't need a really
> > specific version. Or am I missing something?
>
> Think it's available/trivially installable on e.g. Windows? I'm not
> convinced. In any case, our list of build requirements is depressingly
> long already.

As I wrote nearby, it's just a download of an installer away.

> The existing expectation is that we make our build tools in Perl.
> I'm sure Andres doesn't want to write a C parser in Perl, but
> poking around suggests that there are multiple options already
> available in CPAN. I'd much rather tell people "oh, there's YA
> module you need to get from CPAN" than "figure out how to install
> version XXX of LLVM".

As far as I can tell they're all at least one of
1) written in C, so also have build requirements (obviously a shorter
build time)
2) not very good (including plenty unsupported C, not to speak of
various optional extensions we use, not having preprocessor support,
...)
3) unmaintained for many years.

Did you find any convincing ones?

Whereas libclang / llvm seem very unlikely to be unmaintained anytime
soon, given the still increasing adoption. It's also much more complete,
than any such perl module will realistically be.

> The other direction we could plausibly go in is to give up the
> assuption that parsenodes.h and friends are the authoritative
> source of info, and instead declare all these structs in a little
> language based on JSON or what-have-you, from which we generate
> parsenodes.h along with the backend/nodes/ files.

I think this should really work for more than just parsenodes (if you
mean primnodes etc with "friends"), and even more than just node types
(if you mean all the common node types with "friends"). For other Node
types we already have to have pretty complete out/readfuncs support
(e.g. to ship plans to parallel workers). and there's plenty other cases
where we can use that information, e.g. as done in the prototype
attached upthread:

On 2019-09-19 22:18:57 -0700, Andres Freund wrote:
> Using that metadata one can do stuff that wasn't feasible before. As an
> example, the last patch in the series implements a version of
> copyObject() (badly named copyObjectRo()) that copies an object into a
> single allocation. That's quite worthwhile memory-usage wise:
>
> PREPARE foo AS SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, false AS relhasoids, c.relispartition, pg_catalog.array_to_string(c.reloptions || array(select 'toast.' || x from pg_catalog.unnest(tc.reloptions) x), ', '), c.reltablespace, CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, c.relpersistence, c.relreplident, am.amname FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid) LEFT JOIN pg_catalog.pg_am am ON (c.relam = am.oid) WHERE c.oid = '1259';
> EXECUTE foo ;
>
> With single-allocation:
> CachedPlan: 24504 total in 2 blocks; 664 free (0 chunks); 23840 used
> Grand total: 24504 bytes in 2 blocks; 664 free (0 chunks); 23840 used
>
> Default:
> CachedPlan: 65536 total in 7 blocks; 16016 free (0 chunks); 49520 used
> Grand total: 65536 bytes in 7 blocks; 16016 free (0 chunks); 49520 used
>
> And with a bit more elbow grease we could expand that logic so that
> copyObject from such a "block allocated" node tree would already know
> how much memory to allocate, memcpy() the source over to the target, and
> just adjust the pointer offsets.

And I'm currently prototyping implementing the
serialization/deserialization of UNDO records into a compressed format
using very similar information, to resolve the impasse that one side
(among others, Robert) wants efficient and meaningful compression of
undo records, while not believing a general compression library can
provide that, and the other side (most prominently Heikki), doesn't want
to limit format of undo record that much.

> I kind of suspect that we'll be forced into that eventually anyway,
> because one thing you are not going to get from LLVM or a pre-existing
> Perl C parser is anything but the lowest-common-denominator version of
> what's in the structs. I find it really hard to believe that we won't
> need some semantic annotations in addition to the bare C struct
> declarations. As an example: in some cases, pointer values in a Node
> struct point to arrays of length determined by a different field in
> the struct. How do we tie those together without magic?

I did solve that in the patchset posted here by replacing such "bare"
arrays with an array type that includes both the length, and the members
(without loosing the type, using some macro magic). I think that
approach has some promise, not just for this - I'd greatly appreciate
thoughts on the part of the messages upthread (copied at the bottom, for
convenience).

But also:

> I think there has to be an annotation marking the connection, and
> we're not going to find that out from LLVM.

libclang does allow to access macro "expansions" and also parsing of
comments. So I don't think it'd be a problem to just recognize such
connections if we added a few macros for that purpose.

On 2019-09-19 22:18:57 -0700, Andres Freund wrote:
> > The one set of fields this currently can not deal with is the various
> > arrays that we directly reference from nodes. For e.g.
> >
> > typedef struct Sort
> > {
> > Plan plan;
> > int numCols; /* number of sort-key columns */
> > AttrNumber *sortColIdx; /* their indexes in the target list */
> > Oid *sortOperators; /* OIDs of operators to sort them by */
> > Oid *collations; /* OIDs of collations */
> > bool *nullsFirst; /* NULLS FIRST/LAST directions */
> > } Sort;
> >
> > the generic code has no way of knowing that sortColIdx, sortOperators,
> > collations, nullsFirst are all numCols long arrays.
> >
> > I can see various ways of dealing with that:
> >
> > 1) We could convert them all to lists, now that we have fast arbitrary
> > access. But that'd add a lot of indirection / separate allocations.
> >
> > 2) We could add annotations to the sourcecode, to declare that
> > association. That's probably not trivial, but wouldn't be that hard -
> > one disadvantage is that we probably couldn't use that declaration
> > for automated asserts etc.
> >
> > 3) We could introduce a few macros to create array type that include the
> > length of the members. That'd duplicate the lenght for each of those
> > arrays, but I have a bit of a hard time believing that that's a
> > meaningful enough overhead.
> >
> > I'm thinking of a macro that'd be used like
> > ARRAY_FIELD(AttrNumber) *sortColIdx;
> > that'd generate code like
> > struct
> > {
> > size_t nmembers;
> > AttrNumber members[FLEXIBLE_ARRAY_MEMBER];
> > } *sortColIdx;
> >
> > plus a set of macros (or perhaps inline functions + macros) to access
> > them.
>
> I've implemented 3), which seems to work well. But it's a fair bit of
> macro magic.
>
> Basically, one can define a type to be array supported, by once using
> PGARR_DEFINE_TYPE(element_type); which defines a struct type that has a
> members array of type element_type. After that variables of the array
> type can be defined using PGARR(element_type) (as members in a struct,
> variables, ...).
>
> Macros like pgarr_size(arr), pgarr_empty(arr), pgarr_at(arr, at) can be
> used to query (and in the last case also modify) the array.
>
> pgarr_append(element_type, arr, newel) can be used to append to the
> array. Unfortunately I haven't figured out a satisfying a way to write
> pgarr_append() without specifying the element_type. Either there's
> multiple-evaluation of any of the types (for checking whether the
> capacity needs to be increased), only `newel`s that can have their
> address taken are supported (so it can be passed to a helper function),
> or compiler specific magic has to be used (__typeof__ solves this
> nicely).
>
> The array can be allocated (using pgarr_alloc_ro(type, capacity)) so
> that a certain number of elements fit inline.
>

Greetings,

Andres Freund

In response to

Re: WIP: Generic functions for Node types using generated metadata at 2019-10-02 18:47:22 from Tom Lane

Responses

Re: WIP: Generic functions for Node types using generated metadata at 2019-10-03 14:18:04 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2019-10-02 22:32:54	Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message	Andres Freund	2019-10-02 20:21:46	Re: WIP: Generic functions for Node types using generated metadata