From: | Alban Hertroys <haramrae(at)gmail(dot)com> |
---|---|
To: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Performance with high correlation in group by on PK |
Date: | 2017-08-28 12:22:50 |
Message-ID: | CAF-3MvPrZfO-vEfvrSTwEfhvgG1Ft7HFF5JoGZYa+f7wJYWskg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi all,
It's been a while since I actually got to use PG for anything serious,
but we're finally doing some experimentation @work now to see if it is
suitable for our datawarehouse. So far it's been doing well, but there
is a particular type of query I run into that I expect we will
frequently use and that's choosing a sequential scan - and I can't
fathom why.
This is on:
The query in question is:
select "VBAK_MANDT", max("VBAK_VBELN")
from staging.etl00001_vbak
group by "VBAK_MANDT";
This is the header-table for another detail table, and in this case
we're already seeing a seqscan. The thing is, there are 15M rows in
the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
"VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
we only have 1 at the moment!).
Explain analyze says the following about this query:
warehouse=# explain (analyze, buffers) select "VBAK_MANDT",
max("VBAK_VBELN") from staging.etl00001_vbak group by "VBAK_MANDT";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1990054.08..1990054.09 rows=1 width=36) (actual
time=38723.602..38723.602 rows=1 loops=1)
Group Key: "VBAK_MANDT"
Buffers: shared hit=367490 read=1409344
-> Seq Scan on etl00001_vbak (cost=0.00..1918980.72 rows=14214672
width=15) (actual time=8.886..31317.283 rows=14214672 loops=1)
Buffers: shared hit=367490 read=1409344
Planning time: 0.126 ms
Execution time: 38723.674 ms
(7 rows)
As you can see, a sequential scan. The statistics seem quite correct.
If I change the query to select a single value of "VBAK_MANDT" we get:
warehouse=# explain (analyze, buffers) select max("VBAK_VBELN") from
staging.etl00001_vbak where "VBAK_MANDT" = '300';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=1.37..1.38 rows=1 width=32) (actual time=14.911..14.911
rows=1 loops=1)
Buffers: shared hit=2 read=3
InitPlan 1 (returns $0)
-> Limit (cost=0.56..1.37 rows=1 width=11) (actual
time=14.907..14.908 rows=1 loops=1)
Buffers: shared hit=2 read=3
-> Index Only Scan Backward using etl00001_vbak_pkey on
etl00001_vbak (cost=0.56..11498362.31 rows=14214672 width=11) (actual
time=14.906..14.906 rows=1 loops=1)
Index Cond: (("VBAK_MANDT" = '300'::bpchar) AND
("VBAK_VBELN" IS NOT NULL))
Heap Fetches: 1
Buffers: shared hit=2 read=3
Planning time: 0.248 ms
Execution time: 14.945 ms
(11 rows)
That is more in line with my expectations.
Oddly enough, adding "MANDT_VBAK" and the group by back into that last
query, the result is a seqscan again.
For "VBAK_MANDT" we see these statistics:
Null fraction: 0
Average width: 4
Distinct values: 1
Most common values: {300}
Most common frequencies: {1}
Histogram bounds :
Correlation :1
The table definition is:
Table "staging.etl00001_vbak"
Column | Type | Modifiers
---------------------+---------------+-----------
VBAK_MANDT | character(3) | not null
VBAK_VBELN | character(10) | not null
VBAK_ERDAT | date | not null
VBAK_ERZET | character(6) | not null
VBAK_ERNAM | character(12) | not null
VBAK_ANGDT | date | not null
VBAK_BNDDT | date | not null
VBAK_AUDAT | date | not null
...
VBAK_MULTI | character(1) | not null
VBAK_SPPAYM | character(2) | not null
Indexes:
"etl00001_vbak_pkey" PRIMARY KEY, btree ("VBAK_MANDT", "VBAK_VBELN")
"idx_etl00001_vbak_erdat" btree ("VBAK_ERDAT")
A final remark: The table definition was generated by our
reporting/ETL software, hence the case-sensitive column names and the
use of the character type instead of varchar (or text).
What can be done to help the planner choose a smarter plan?
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
From | Date | Subject | |
---|---|---|---|
Next Message | Alban Hertroys | 2017-08-28 12:48:50 | Re: Performance with high correlation in group by on PK |
Previous Message | Christoph Moench-Tegeder | 2017-08-28 11:06:31 | Re: Logging the fact that a log was shipped |