Re: range_agg

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Fetter <david(at)fetter(dot)org>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: range_agg
Date: 2019-07-06 06:33:55
Message-ID: CAFj8pRCUQxPwkoGqDfN-BSqCXTrc6KVBJRP_yQRsthOdx=uC7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pá 5. 7. 2019 v 19:22 odesílatel Paul A Jungwirth <
pj(at)illuminatedcomputing(dot)com> napsal:

> On Fri, Jul 5, 2019 at 4:31 AM Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
> > The first issue is unstable regress tests - there is a problem with
> opr_sanity
>
> I would prefer to avoid needing to add anything to opr_sanity really.
> A multirange would let me achieve that I think. But otherwise I'll add
> the ordering. Thanks!
>
> > + rangeTypeId = get_fn_expr_argtype(fcinfo->flinfo, 1);
> > + if (!type_is_range(rangeTypeId))
> > + {
> > + ereport(ERROR, (errmsg("range_agg must be called with a
> range")));
> > + }
>
> I think this was left over from my attempts to support user-defined
> ranges. I believe I can remove it now.
>
> > +# Making 2- and 3-param range_agg polymorphic is difficult
> > +# because it would take an anyrange and return an anyrange[],
> > +# which doesn't exist.
> > +# As a workaround we define separate functions for each built-in range
> type.
> > +# This is what causes the mess in
> src/test/regress/expected/opr_sanity.out.
> > +{ oid => '8003', descr => 'aggregate transition function',
> >
> > This is strange. Does it means so range_agg will not work with custom
> range types?
>
> You would have to define your own range_agg with your own custom type
> in the signature. I gave instructions about that in my range_agg
> extension (https://github.com/pjungwir/range_agg) but it's not as
> easy with range_agg in core because I don't think you can define a
> custom function that is backed by a built-in C function. At least I
> couldn't get that to work.
>

I understand so anybody can implement own function, but this is fundamental
problem in implementation.

In this case I prefer to start implement anyrangearray type first. It is
not hard work.

>
> The biggest argument for a multirange to me is that we could have
> anymultirange, with similar rules as anyarray. That way I could have
> `range_agg(r anyrange) RETURNS anymultirange`. [1] is a conversation
> where I asked about doing this before multiranges were suggested. Also
> I'm aware of your own recent work on polymorphic types at [2] but I
> haven't read it closely enough to see if it would help me here. Do you
> think it applies?
>

I don't see any difference between anymultirange or anyrangearray - it is
just name

ANSI SQL has a special kind for this purpose - "set". It is maybe better,
because PostgreSQL's arrays revoke a idea of ordering, but it is hard to do
some generic order for ranges.

Functionality is important - if you can do some special functionality, that
cannot be implemented as "array of ranges", then the new type is necessary.
If all functionality can be covered by array of ranges, then there is not
necessity of new type.

I don't talk about polymorphic types - probably it needs one -
"anymultirange" or "anyrangearray".

If some operation can be done smarter with new type, then I am ok for new
type, if not, then we should to use array of ranges.

> > I am not sure about implementation. I think so accumulating all ranges,
> sorting and processing on the end can be memory and CPU expensive.
>
> I did some research and couldn't find any algorithm that was better
> than O(n log n), but if you're aware of any I'd like to know about it.
> Assuming we can't improve on the complexity bounds, I think a sort +
> iteration is desirable because it keeps things simple and benefits
> from past & future work on the sorting code. I care more about
> optimizing time here than RAM, especially since we'll use this in FK
> checks, where the inputs will rarely be more than a few and probably
> never in the millions.
>
> We especially want to avoid an O(n^2) algorithm, which would be easy
> to stumble into if we naively accumulated the result as we go. For
> example if we had multiranges you could imagine an implementation that
> just did `result + r` for all inputs r. But that would have the same
> n^2 pattern as looping over strcat.
>
> We could use a tree to keep things sorted and accumulate as we go, but
> the implementation would be more complicated and I think slower.
>

I don't think - working with large arrays is slow, due often cache miss. I
understand so it depends on data, but in this area, I can imagine
significant memory reduction based on running processing.

array builder doesn't respect work_men, and I think so any different way is
safer.

Regards

Pavel

>
> Thanks for the review!
>
> Paul
>
> [1]
> https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com
>
> [2]
> https://www.postgresql.org/message-id/CAFj8pRDna7VqNi8gR+Tt2Ktmz0cq5G93guc3Sbn_NVPLdXAkqA@mail.gmail.com
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-07-06 07:05:06 Re: Comment typo in tableam.h
Previous Message Pavel Stehule 2019-07-06 06:09:40 Re: range_agg