Re: Partial aggregates pushdown

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, "Fujii(dot)Yuki(at)df(dot)MitsubishiElectric(dot)co(dot)jp" <Fujii(dot)Yuki(at)df(dot)mitsubishielectric(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, vignesh C <vignesh21(at)gmail(dot)com>, Alexander Pyhalov <a(dot)pyhalov(at)postgrespro(dot)ru>
Subject: Re: Partial aggregates pushdown
Date: 2024-08-22 18:31:11
Message-ID: ea31e84d-609d-47f5-9490-2d1c2551d587@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/22/24 19:22, Bruce Momjian wrote:
> On Wed, Aug 21, 2024 at 04:59:12PM +0200, Tomas Vondra wrote:
>> On 8/20/24 20:41, Bruce Momjian wrote:
>>> SELECT (oid, relname) FROM pg_class LIMIT 1;
>>> row
>>> ---------------------
>>> (2619,pg_statistic)
>>>
>>> SELECT pg_typeof((oid, relname)) FROM pg_class LIMIT 1;
>>> pg_typeof
>>> -----------
>>> record
>>>
>>> SELECT pg_typeof(oid) FROM pg_class LIMIT 1;
>>> pg_typeof
>>> -----------
>>> oid
>>>
>>> SELECT pg_typeof(relname) FROM pg_class LIMIT 1;
>>> pg_typeof
>>> -----------
>>> name
>>>
>>
>> How would this help with the AVG(bigint) case? We don't have int128 as
>> SQL type, so what would be part of the record? Also, which part of the
>> code would produce the record? If the internal state is "internal", that
>> would probably need to be something aggregate specific, and that's kinda
>> what this patch series is adding, no?
>>
>> Or am I missing some cases where the record would make it work?
>
> Right now, my goal in this thread is to try to concretely explain what
> is being proposed. Therefore, I think I need to go back and make four
> categories instead of two:
>
> 1. cases like MAX(int), where we return only one value, and the FDW
> return value is an existing data type, e.g., int
>
> 2. cases like AVG(int) where we return multiple FDW values of the same
> type and can use an array, e.g., bigint array
>
> 3. cases like AVG(bigint) where we return multiple FDW values of the
> same type (or can), but the one of the FDW return values is not an
> existing data type, e.g. int128
>
> 4. cases like AVG(interval) where we return multiple FDW values of
> different types, e.g. interval and an integral count
>
> For #1, all MAX cases have aggregate input parameters the same as the
> FDW return types (aggtranstype):
>
> SELECT proargtypes[0]::regtype, aggtranstype::regtype
> FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
> WHERE proname = 'max' AND proargtypes[0] != aggtranstype;
>
> proargtypes | aggtranstype
> -------------+--------------
>
> For #2-4, we have for AVG:
>
> SELECT proargtypes[0]::regtype, aggtranstype::regtype
> FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
> WHERE proname = 'avg';
>
> proargtypes | aggtranstype
> ------------------+--------------------
> 3-> bigint | internal
> 2-> integer | bigint[]
> 2-> smallint | bigint[]
> 3-> numeric | internal
> 2-> real | double precision[]
> 2-> double precision | double precision[]
> 4-> interval | internal
>
> You can see which AVG items fall into which categories. It seems we
> have #1 and #2 handled cleanly in the patch.
>

Agreed.

> My question is related to #3 and #4. For #3, if we are going to be
> building infrastructure to handle passing int128 for AVG, wouldn't it be
> wiser to create an int128 type and an int128 array type, and then use
> method #2 to handle those, rather than creating custom code just to
> read/write int128 values for FDWs aggregate passing alone.
>

Yep, adding int128 as a data type would extend this to aggregates that
store state as int128 (or array of int128).

> For #4, can we use or improve the RECORD data type to handle #4 --- that
> seems preferable to creating custom FDWs aggregate passing code.
>
> I know the open question was whether we should create custom FDWs
> aggregate passing functions or custom data types for FDWs aggregate
> passing, but I am asking if we can improve existing facilities, like
> int128 or record passing, to reduce the need for some of these.
>

But which code would produce the record? AFAIK it can't happen in some
generic executor code, because that only sees "internal" for each
aggregate. The exact structure of the aggstate is private within the
code of each aggregate - the record would have to be built there, no?

I imagine we'd add this for each aggregate as a new pair of functions to
build/parse the record, but that's kinda the serial/deserial way we
discussed earlier.

Or are you suggesting we'd actually say:

CREATE AGGREGATE xyz(...) (
STYPE = record,
...
)

or something like that? I have no idea if that would work, maybe it
would. In a way it'd not be that different from the "custom data type"
except that it doesn't actually need a custom data type (assuming we
know how to serialize such "generic" records - I'm not familiar with
this code enough to answer that).

The reason why I found the "custom data type" approach interesting is
that it sets clear guarantees / expectations about the stability of the
output between versions. For serial/deserial we make no guarantees, it
can change even in a minor release. But here we need something stable
even for major releases, to allow pushdown when querying older server.
We have that strong expectation for data types, and everyone knows it.

But I guess if we are OK with just sending the array aggregates, or or
even just plain scalar types, we kinda shift this responsibility to
aggregates. Because while the data type in/out functions will stay the
same, we require the aggregate to not switch to some other state.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-08-22 18:41:50 Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Previous Message Mark Dilger 2024-08-22 18:28:39 Re: Index AM API cleanup