Re: Faster distinct query?

From: Ryan Booz <ryan(at)timescale(dot)com>
To: Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc: PostgreSQL Mailing Lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: Faster distinct query?
Date: 2021-09-22 22:05:19
Message-ID: CADyMnExT5aVP2Jsn-Ve_q6cuXxzOvO2N0kubHHK2am1nwsNEzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as
one of the execution nodes. I also realize I was making a few assumptions
about your data, are channels shared among stations, or are all channels
unique (like an ID) per station? That would impact the index and approach.

Something like:

station | channel
----------|-----------
1 1
1 2
2 3
2 4

or:
station | channel
----------|-----------
1 1
1 2
2 1
2 2

On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster(at)alaska(dot)edu>
wrote:

> On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan(at)timescale(dot)com> wrote:
>
> Cool. I'd be interested to see the explain on it if you ever try it again.
> On that cardinality, I'd expect it to be really fast, so I'm interested to
> see if the (SkipScan) nodes were actually used.
>
>
> With timescaledb extension installed, the explain is what I posted in the
> original message (https://explain.depesz.com/s/mtxB#html) Without
> timescaledb installed, the explain looks the same, except it takes twice as
> long to run.
>
> Unless I missed something in your message, i.e. some sort of tweak to the
> query to get it to use the timescaledb features?
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell: 907-328-9145
>
>
> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster(at)alaska(dot)edu>
> wrote:
>
>>
>> On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan(at)timescale(dot)com> wrote:
>>
>> [Timescale Dev Advocate here]
>> I realize this might not be the most accepted answer (could be
>> interpreted as trying to "sell" something), but feels like an opportunity
>> to talk about DISTINCT queries and opportunities. Because you have that
>> index, Timescale 2.3 added a "Skip Scan" query planner node that works on
>> regular BTree indexes (it doesn't have to be time-series/TimescaleDB
>> Hypertable data at all). In this case, your distinct query would likely run
>> in a few milliseconds based on the counts you mention (170 stations, 3
>> channels per station), and then the outer aggregation would do the GROUP
>> BY. So, you **could** add the TimescaleDB extension to your database (or a
>> copy of) and give it a try. You don't actually need to use any TimescaleDB
>> features otherwise.
>>
>>
>> I had actually already done that, as I was considering, in spite of past
>> negative experiences with timescaledb, experimenting with it on this DB to
>> see if it worked any better with this data. Out of curiosity, I tried
>> removing the timescaledb extension, whereupon the query in question took
>> roughly twice as long. So you are right that installing timescaledb speeds
>> things up, even when not using any timescaledb specific functions. So that
>> was a good call. Thanks!
>>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell: 907-328-9145
>>
>>
>>
>> - A writeup of why this kind of DISTINCT query is slow in PostgreSQL
>> (for now) and what we did to overcome it:
>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>> - Plans for a similar feature in PostgreSQL proper that we'd totally
>> support but hasn't made forward progress yet:
>> https://commitfest.postgresql.org/19/1741/
>>
>> Anyway, it might be worth a shot. HTH
>>
>> Ryan B
>>
>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
>> david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>>
>>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis(at)entrata(dot)com>
>>> wrote:
>>>
>>>> In the future, please share the plan returned by explain analyze, and
>>>> some data about how many rows in the involved tables,
>>>>
>>>
>>> I believe we consider it acceptable to link to an explain viewer, which
>>> is what the OP did. Reading explain output in email has its own
>>> challenges, and I'd rather have the website than a text attachment.
>>>
>>>
>>>> How does the below work? It should do a very simple index scan only,
>>>> then aggregate the relative few rows after the fact.
>>>>
>>>> select station, array_agg(distinct(channel)) as channels
>>>> FROM(
>>>> SELECT station,channel FROM data GROUP BY station,channel
>>>> ) AS sub
>>>> group by station;
>>>>
>>>
>>> Yeah, am pondering this too, though seems like the queries should be
>>> identical so the plan/execution should be the same either way.
>>>
>>>
>>>> If there is correlation between station & channel, then you might look
>>>> at creating a multivariate statistics object and analyzing the table so the
>>>> planner can make better choices
>>>>
>>>
>>> There is no where clause so I'm doubtful there is much to be gained
>>> going down this path. The Index-Only scan seems like an optimal way to
>>> obtain this data and the existing query already does that. The aggregation
>>> path might vary though it seems like that shouldn't be the case here.
>>>
>>> David J.
>>>
>>
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Israel Brewster 2021-09-22 22:22:47 Re: Faster distinct query?
Previous Message Israel Brewster 2021-09-22 21:53:48 Re: Faster distinct query?