From: | Alban Hertroys <haramrae(at)gmail(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Performance with high correlation in group by on PK |
Date: | 2017-09-15 15:13:47 |
Message-ID: | CAF-3MvPTg6p=5dofWHOyLgr5fGJT9z21D1ywxE4bAQ766MFKkw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 8 September 2017 at 00:23, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys <haramrae(at)gmail(dot)com> wrote:
>>
>> On 28 August 2017 at 21:32, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> > On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys <haramrae(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> It's been a while since I actually got to use PG for anything serious,
>> >> but we're finally doing some experimentation @work now to see if it is
>> >> suitable for our datawarehouse. So far it's been doing well, but there
>> >> is a particular type of query I run into that I expect we will
>> >> frequently use and that's choosing a sequential scan - and I can't
>> >> fathom why.
>> >>
>> >> This is on:
>> >>
>> >>
>> >> The query in question is:
>> >> select "VBAK_MANDT", max("VBAK_VBELN")
>> >> from staging.etl00001_vbak
>> >> group by "VBAK_MANDT";
>> >>
>> >> This is the header-table for another detail table, and in this case
>> >> we're already seeing a seqscan. The thing is, there are 15M rows in
>> >> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
>> >> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
>> >> we only have 1 at the moment!).
>> >
>> >
>> > You need an "index skip-scan" or "loose index scan". PostgreSQL doesn't
>> > currently detect and implement them automatically, but you can use a
>> > recursive CTE to get it to work. There are some examples at
>> > https://wiki.postgresql.org/wiki/Loose_indexscan
>>
>> Thanks Jeff, that's an interesting approach. It looks very similar to
>> correlated subqueries.
>>
>> Unfortunately, it doesn't seem to help with my issue. The CTE is
>> indeed fast, but when querying the results from the 2nd level ov the
>> PK with the CTE results, I'm back at a seqscan on pdw00002_vbak again.
>
>
> Something like this works:
>
> create table foo as select trunc(random()*5) as col1, random() as col2 from
> generate_series(1,100000000);
> create index on foo (col1, col2);
> vacuum analyze foo;
>
>
> with recursive t as (
> select * from (select col1, col2 from foo order by col1 desc, col2 desc
> limit 1) asdfsaf
> union all
> select
> (select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2
> desc limit 1) as col1,
> (select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2
> desc limit 1) as col2
> from t where t.col1 is not null
> )
> select * from t where t is not null;
>
> It is pretty ugly that you need one subquery in the select list for each
> column to be returned. Maybe someone can find a way to avoid that part. I
> tried using lateral joins to get around it, but couldn't make that work.
>
> Cheers,
>
> Jeff
Thanks Jeff. That does indeed look ugly.
Since we're dealing with a 4GL language (FOCUS) that translates to
SQL, I don't think we'll attempt your workaround, even though we can
use SQL directly in our reports if we want to.
But, I just remembered giving someone else in a similar situation some
advice on this very list; Obviously, when my first primary key field
is not very selective, I should change the order of the fields in the
PK!
But let's first enjoy the weekend.
Alban.
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
From | Date | Subject | |
---|---|---|---|
Next Message | Rob Sargent | 2017-09-15 15:30:47 | Re: looking for a globally unique row ID |
Previous Message | Francisco Olarte | 2017-09-15 14:54:34 | Re: Table partition - parent table use |