Understanding GROUP BY Plan in PostgreSQL

From: Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com>
To: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Understanding GROUP BY Plan in PostgreSQL
Date: 2015-05-14 01:01:33
Message-ID: CADp-Sm7dOrsyW9jaQCXMQ0isuLnmS0b8BeZeSNm9tO2-ywn2Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

I am trying to understand PostgreSQL's plan for GROUP BY query.

Let me first share some details of the env-

RAM
edbstore=# \! free
total used free shared buffers cached
Mem: 1915456 1800936 114520 46268 98000 1194272
-/+ buffers/cache: 508664 1406792
Swap: 2097148 76 2097072

CPU
edbstore=# \! lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2

OS-

cat /etc/redhat-release
CentOS release 6.6 (Final)

usnam -a gives me below info-
2.6.32-504.el6.x86_64

Here is my table structure-
edbstore=# \d cust_hist
Table "edbstore.cust_hist"
Column | Type | Modifiers
------------+---------+-----------
customerid | integer | not null
orderid | integer | not null
prod_id | integer | not null
Indexes:
"idx_cust_hist_ord" btree (orderid)
"ix_cust_hist_customerid" btree (customerid)
Foreign-key constraints:
"fk_cust_hist_customerid" FOREIGN KEY (customerid) REFERENCES
customers(customerid) ON DELETE CASCADE

The size of table
edbstore=# select pg_size_pretty(pg_relation_size('cust_hist'));
pg_size_pretty
----------------
2616 kB
(1 row)

edbstore=# select pg_size_pretty(pg_relation_size('idx_cust_hist_ord'));
pg_size_pretty
----------------
1336 kB
(1 row)

Stats in pg_class
edbstore=# select relpages, reltuples from pg_class where
relname='cust_hist';
relpages | reltuples
----------+-----------
327 | 60350
(1 row)

edbstore=# select relpages, reltuples from pg_class where
relname='idx_cust_hist_ord';
relpages | reltuples
----------+-----------
167 | 60350
(1 row)

edbstore=# select count(*) from cust_hist;
count
-------
60350
(1 row)

My DB parameters-
edbstore=# show work_mem;
work_mem
----------
1MB
(1 row)

edbstore=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

edbstore=# show random_page_cost ;
random_page_cost
------------------
2
(1 row)

edbstore=# show seq_page_cost ;
seq_page_cost
---------------
1
(1 row)

edbstore=# select version();
version
---------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2
20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)

The explain plan-

edbstore=# set enable_seqscan to on;
SET

edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist
group by orderid;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
--
HashAggregate (cost=1232.25..1343.88 rows=11163 width=4) (actual
time=49.494..55.302 rows=12000 loops=1)
-> Seq Scan on cust_hist (cost=0.00..930.50 rows=60350 width=4)
(actual time=0.016..14.353 rows=60350 loops=1
)
Total runtime: 56.749 ms
(3 rows)

edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist
group by orderid;
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------
----------------------------------
GroupAggregate (cost=0.29..1980.92 rows=11163 width=4) (actual
time=0.055..51.766 rows=12000 loops=1)
-> Index Only Scan using idx_cust_hist_ord on cust_hist
(cost=0.29..1567.54 rows=60350 width=4) (actual time=
0.041..28.532 rows=60350 loops=1)
Heap Fetches: 60350
Total runtime: 53.227 ms
(4 rows)

I tried to increase data to 4 times (by copying the same set of data to the
table 3 more times).

edbstore=# set enable_seqscan to on;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist
group by orderid;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
-----
HashAggregate (cost=4926.00..5035.16 rows=10916 width=4) (actual
time=242.233..247.740 rows=12000 loops=1)
-> Seq Scan on cust_hist (cost=0.00..3719.00 rows=241400 width=4)
(actual time=0.040..96.845 rows=241400 loop
s=1)
Total runtime: 249.415 ms
(3 rows)

edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist
group by orderid;
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------
-------------------------------------
GroupAggregate (cost=0.42..8796.20 rows=10916 width=4) (actual
time=0.228..217.513 rows=12000 loops=1)
-> Index Only Scan using idx_cust_hist_ord on cust_hist
(cost=0.42..7480.04 rows=241400 width=4) (actual time
=0.190..145.130 rows=241400 loops=1)
Heap Fetches: 241400
Total runtime: 219.003 ms
(4 rows)

It seems PostgreSQL always prefers to choose Sequential Scan over Index
only scan. Also the cost of just the specific step involving Index Only
Scan seems to be higher than sequential scan. I am also not able to
understand the plan involving the index only scan where the Index Only scan
starts with 0.42 cost and even the next step GroupAggregate starts with
same cost.

Can someone please help me understand this plan also why PostgreSQL prefers
the Sequential Scan for GROUP BY clause?

Browse pgsql-novice by date

  From Date Subject
Next Message Wei Shan 2015-05-14 15:38:41 Managing PostgreSQL Streaming Replication Cluster
Previous Message Luca Ferrari 2015-05-12 07:07:16 Re: Why is Hash index not transaction safe.