Quick Links

Performance with high correlation in group by on PK

From:	Alban Hertroys <haramrae(at)gmail(dot)com>
To:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Performance with high correlation in group by on PK
Date:	2017-08-28 12:22:50
Message-ID:	CAF-3MvPrZfO-vEfvrSTwEfhvgG1Ft7HFF5JoGZYa+f7wJYWskg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi all,

It's been a while since I actually got to use PG for anything serious,
but we're finally doing some experimentation @work now to see if it is
suitable for our datawarehouse. So far it's been doing well, but there
is a particular type of query I run into that I expect we will
frequently use and that's choosing a sequential scan - and I can't
fathom why.

This is on:

The query in question is:
select "VBAK_MANDT", max("VBAK_VBELN")
from staging.etl00001_vbak
group by "VBAK_MANDT";

This is the header-table for another detail table, and in this case
we're already seeing a seqscan. The thing is, there are 15M rows in
the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
"VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
we only have 1 at the moment!).

Explain analyze says the following about this query:
warehouse=# explain (analyze, buffers) select "VBAK_MANDT",
max("VBAK_VBELN") from staging.etl00001_vbak group by "VBAK_MANDT";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1990054.08..1990054.09 rows=1 width=36) (actual
time=38723.602..38723.602 rows=1 loops=1)
Group Key: "VBAK_MANDT"
Buffers: shared hit=367490 read=1409344
-> Seq Scan on etl00001_vbak (cost=0.00..1918980.72 rows=14214672
width=15) (actual time=8.886..31317.283 rows=14214672 loops=1)
Buffers: shared hit=367490 read=1409344
Planning time: 0.126 ms
Execution time: 38723.674 ms
(7 rows)

As you can see, a sequential scan. The statistics seem quite correct.

If I change the query to select a single value of "VBAK_MANDT" we get:
warehouse=# explain (analyze, buffers) select max("VBAK_VBELN") from
staging.etl00001_vbak where "VBAK_MANDT" = '300';

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=1.37..1.38 rows=1 width=32) (actual time=14.911..14.911
rows=1 loops=1)
Buffers: shared hit=2 read=3
InitPlan 1 (returns $0)
-> Limit (cost=0.56..1.37 rows=1 width=11) (actual
time=14.907..14.908 rows=1 loops=1)
Buffers: shared hit=2 read=3
-> Index Only Scan Backward using etl00001_vbak_pkey on
etl00001_vbak (cost=0.56..11498362.31 rows=14214672 width=11) (actual
time=14.906..14.906 rows=1 loops=1)
Index Cond: (("VBAK_MANDT" = '300'::bpchar) AND
("VBAK_VBELN" IS NOT NULL))
Heap Fetches: 1
Buffers: shared hit=2 read=3
Planning time: 0.248 ms
Execution time: 14.945 ms
(11 rows)

That is more in line with my expectations.

Oddly enough, adding "MANDT_VBAK" and the group by back into that last
query, the result is a seqscan again.

For "VBAK_MANDT" we see these statistics:
Null fraction: 0
Average width: 4
Distinct values: 1
Most common values: {300}
Most common frequencies: {1}
Histogram bounds :
Correlation :1

A final remark: The table definition was generated by our
reporting/ETL software, hence the case-sensitive column names and the
use of the character type instead of varchar (or text).

What can be done to help the planner choose a smarter plan?
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Responses

Re: Performance with high correlation in group by on PK at 2017-08-28 12:48:50 from Alban Hertroys
Re: Performance with high correlation in group by on PK at 2017-08-28 19:32:19 from Jeff Janes

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alban Hertroys	2017-08-28 12:48:50	Re: Performance with high correlation in group by on PK
Previous Message	Christoph Moench-Tegeder	2017-08-28 11:06:31	Re: Logging the fact that a log was shipped