Re: Faster distinct query?

From: Israel Brewster <ijbrewster(at)alaska(dot)edu>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Michael Lewis <mlewis(at)entrata(dot)com>, PostgreSQL Mailing Lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: Faster distinct query?
Date: 2021-09-23 01:21:53
Message-ID: 3D6B751F-C6F4-4C56-8DF8-3F88F5E2F78D@alaska.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> On Sep 22, 2021, at 5:10 PM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Thu, 23 Sept 2021 at 08:27, David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>>
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis(at)entrata(dot)com> wrote:
>>> If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices
>>
>>
>> There is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.
>
> ndistinct extended statistics would be used to estimate the number of
> groups in the GROUP BY clause for the version of the query that
> performs GROUP BY station, channel. We've not seen the EXPLAIN
> ANALYZE for that query, so don't know if there's any use for extended
> statistics there. However, if the planner was to think there were
> more groups than there actually are, then it would be less inclined to
> do parallelise the GROUP BY. I think writing the query in such a way
> that allows it to be parallelised is likely going to result in some
> quite good performance improvements. i.e:
>
> select station, array_agg(channel) as channels
> FROM(
> SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;
>

Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! See the explain analyze output here: https://explain.depesz.com/s/L5Bf <https://explain.depesz.com/s/L5Bf> It looks more complicated, but being able to run parallel definitely makes a difference, and there may be some other improvements in there that I’m not aware of as well!

Still not quite fast enough for real-time queries, but certainly fast enough to keep a materialized view updated.

And this is why I love postgresql and this community - when something isn’t working as well as I would like, there is usually a way to improve it drastically :-)

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145

> which is just the same as Michael's version but without DISTINCT.
> Also, Tom's point about lots of heap fetches is going to count for
> quite a bit too, especially so if I/O plays a large part in the total
> query time.
>
> David

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2021-09-23 01:28:32 Re: Faster distinct query?
Previous Message David Rowley 2021-09-23 01:10:38 Re: Faster distinct query?