Re: Parallel Aggregates for string_agg and array_agg

From: Tomer Praizler <tomer(dot)praizler(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2019-03-13 23:43:12
Message-ID: CAD=kdR8qQehNXLCXj5tM92ZWh8cB4OFVNNQjC-T6ZRgZiOghtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Thank you, David!
I guess I wasn't 100% clear, an example will do the trick.

So let's take for example the following records:

timestamp port super_name name ids count
2019-03-06 10:00:00 22 ssh abc 1,2 10
2019-03-06 10:00:00 22 ssh xyz 3,4
20
2019-03-06 10:00:00 22 ssh abc 5,6 30
2019-03-06 10:00:00 22 ssh abc 7,8 40
2019-03-06 10:00:00 22 ssh foo 9,10 50

The primary key is combined of 6 columns in this example: (timestamp, port,
super_name), I have around 8 values for every key.
My table is partitioned by day, and I have around 2-3M rows in each
partition. I am trying to aggregate on the last 7 days, which is around 23M
rows.

my query looks something like this:

SELECT x.timestamp, x.port, x.super_name, max(x.timestamp) AS last_seen,
coalesce(array_length(array_merge_agg(x.ids), 1), 0) AS my_count,
coalesce(array_length(array_agg(DISTINCT x.name), 1), 0) AS names,
sum(x.count) AS final_count
FROM x
GROUP BY x.timestamp, x.port, x.super_name
ORDER BY sum(x.count)

This result in a plan without parallel execution because of the array_agg
on a string field. when I remove it the query planner spawns a
parallel execution plan.
It reduces the time from 5 minutes to around 1 minute which is also a lot.
(if there is any idea on how to optimize farther please help:) btw,
hardware resources is not a problem)

Thanks!

On Thu, 14 Mar 2019 at 01:02 David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:

> On Thu, 14 Mar 2019 at 09:53, Tomer Praizler <tomer(dot)praizler(at)gmail(dot)com>
> wrote:
> > Just wanted to check if there is any magic that can be done to make this
> happen.
> > The closest thing I ran into was this guy patch -
> https://www.postgresql.org/message-id/CAKJS1f8LV7AT%3DAAhdYGKtGrGkSkEgO6C_SW2Ztz1sR3encisqw%40mail.gmail.com
>
> If you're able to change the SQLs and point them at some other
> aggregate function, then you could create an extension with the code
> from that patch and CREATE AGGREGATE your own version of those
> functions using the combine and [de]serial functions. I know that
> creating your own extension is pretty out there for the novice mailing
> list, but if I thought of another easier way if have told you that
> instead.
>
> > I didn't try it, but wanted to check if there is any way to deal with
> the need to aggregate on a string, by creating an array while doing a group
> by? Should I manipulate my data to be able to do it, maybe by generating an
> int out of those strings? Any other idea?
>
> Well, array_agg is non-parallel too, so don't see how having an array
> and converting that into a string later would help. The other
> aggregates that are parallel aware don't really let you get individual
> values back out of the aggregated state, so there's not really a way
> to turn that into an array or a string containing all the values that
> were aggregated.
>
> Does your use-case really need parallel versions of these aggregates?
> I imagined that these would perform best when the underlying scan had
> to skip lots of values, or when the aggregate had a FILTER (WHERE ...)
> clause. Maybe if the filtering can be done by using an index then
> performance would up to the level you need?
>
> If you could share a simplified version of your use case perhaps
> someone can suggest a way to speed it up another way.
>
> --
> David Rowley http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message David Rowley 2019-03-14 00:16:59 Re: Parallel Aggregates for string_agg and array_agg
Previous Message David Rowley 2019-03-13 23:01:48 Re: Parallel Aggregates for string_agg and array_agg