Re: Performance with high correlation in group by on PK

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alban Hertroys <haramrae(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Performance with high correlation in group by on PK
Date: 2017-09-07 22:23:09
Message-ID: CAMkU=1zUOM5GOXZKvvFbtgU53s7U5vBAwKrkJTzzvZFcbzjh-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys <haramrae(at)gmail(dot)com> wrote:

> On 28 August 2017 at 21:32, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys <haramrae(at)gmail(dot)com>
> wrote:
> >>
> >> Hi all,
> >>
> >> It's been a while since I actually got to use PG for anything serious,
> >> but we're finally doing some experimentation @work now to see if it is
> >> suitable for our datawarehouse. So far it's been doing well, but there
> >> is a particular type of query I run into that I expect we will
> >> frequently use and that's choosing a sequential scan - and I can't
> >> fathom why.
> >>
> >> This is on:
> >>
> >>
> >> The query in question is:
> >> select "VBAK_MANDT", max("VBAK_VBELN")
> >> from staging.etl00001_vbak
> >> group by "VBAK_MANDT";
> >>
> >> This is the header-table for another detail table, and in this case
> >> we're already seeing a seqscan. The thing is, there are 15M rows in
> >> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
> >> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
> >> we only have 1 at the moment!).
> >
> >
> > You need an "index skip-scan" or "loose index scan". PostgreSQL doesn't
> > currently detect and implement them automatically, but you can use a
> > recursive CTE to get it to work. There are some examples at
> > https://wiki.postgresql.org/wiki/Loose_indexscan
>
> Thanks Jeff, that's an interesting approach. It looks very similar to
> correlated subqueries.
>
> Unfortunately, it doesn't seem to help with my issue. The CTE is
> indeed fast, but when querying the results from the 2nd level ov the
> PK with the CTE results, I'm back at a seqscan on pdw00002_vbak again.
>

Something like this works:

create table foo as select trunc(random()*5) as col1, random() as col2 from
generate_series(1,100000000);
create index on foo (col1, col2);
vacuum analyze foo;

with recursive t as (
select * from (select col1, col2 from foo order by col1 desc, col2 desc
limit 1) asdfsaf
union all
select
(select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2
desc limit 1) as col1,
(select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2
desc limit 1) as col2
from t where t.col1 is not null
)
select * from t where t is not null;

It is pretty ugly that you need one subquery in the select list for each
column to be returned. Maybe someone can find a way to avoid that part. I
tried using lateral joins to get around it, but couldn't make that work.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2017-09-07 23:04:30 Re: pg_rewind issue
Previous Message Ron Johnson 2017-09-07 22:10:13 Re: WAL & ready files retained after turning off log shipping