Table Partitions / Partial Indexes

From: Mike C <smith(dot)not(dot)western(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Table Partitions / Partial Indexes
Date: 2005-12-12 02:07:59
Message-ID: bd0eabd0512111807q6b54115q2477c7c2a37f1c1a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I've got a table with ~60 Million rows and am having performance problems
querying it. Disks are setup as 4x10K SCSI 76GB, RAID 1+0. The table is
being inserted into multiple times every second of the day, with no updates
and every 2nd day we delete 1/60th of the data (as it becomes old). Vacuum
analyze is scheduled to run 3 times a day.

Query:

select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC
from PC_TRAFFIC where FK_DEVICE = 996 and TRAFFIC_DATE >= '2005-10-14
00:00:00' and TRAFFIC_DATE <= '2005-11-13 23:59:59' group by SOURCE_MAC
order by 1 desc

Table:

CREATE TABLE PC_TRAFFIC (
PK_PC_TRAFFIC INTEGER NOT NULL,
TRAFFIC_DATE TIMESTAMP NOT NULL,
SOURCE_MAC CHAR(20) NOT NULL,
DEST_IP CHAR(15),
DEST_PORT INTEGER,
TOTAL_TO DOUBLE PRECISION,
TOTAL_FROM DOUBLE PRECISION,
FK_DEVICE SMALLINT,
PROTOCOL_TYPE SMALLINT
);

CREATE INDEX pc_traffic_pkidx ON pc_traffic (pk_pc_traffic);
CREATE INDEX pc_traffic_idx3 ON pc_traffic (fk_device, traffic_date);

Plan:
Sort (cost=76650.58..76650.58 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=76650.54..76650.57 rows=2 width=40)
-> Bitmap Heap Scan on pc_traffic
(cost=534.64..76327.03rows=43134 width=40)
Recheck Cond: ((fk_device = 996) AND (traffic_date >=
'2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <=
'2005-10-31 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on pc_traffic_idx3
(cost=0.00..534.64rows=43134 width=0)
Index Cond: ((fk_device = 996) AND (traffic_date >=
'2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <=
'2005-10-31 23:59:59'::timestamp without time zone))
(7 rows)

CLUSTER on PC_TRAFFIC_IDX3 gives me significantly improved performance:

Sort (cost=39886.65..39886.66 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=39886.61..39886.64 rows=2 width=40)
-> Index Scan using pc_traffic_idx3 on pc_traffic (cost=
0.00..39551.26 rows=44714 width=40)
Index Cond: ((fk_device = 996) AND (traffic_date >=
'2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <=
'2005-10-31 23:59:59'::timestamp without time zone))
(5 rows)

However the clustering is only effective on the first shot. Because of the
constant usage of the table we can't perform a vacuum full nor any exclusive
lock function.

Would table partitioning/partial indexes help much? Partitioning on date
range doesn't make much sense for this setup, where a typical 1-month query
spans both tables (as the billing month for the customer might start midway
through a calendar month).

Noting that the index scan was quicker than the bitmap, I'm trying to make
the indexes smaller/more likely to index scan. I have tried partitioning
against fk_device, with 10 child tables. I'm using fk_device % 10 = 1,
fk_device % 10 = 2, fk_device % 10 = 3, etc... as the check constraint.

CREATE TABLE pc_traffic_0 (CHECK(FK_DEVICE % 10 = 0)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_1 (CHECK(FK_DEVICE % 10 = 1)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_2 (CHECK(FK_DEVICE % 10 = 2)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_3 (CHECK(FK_DEVICE % 10 = 3)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_4 (CHECK(FK_DEVICE % 10 = 4)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_5 (CHECK(FK_DEVICE % 10 = 5)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_6 (CHECK(FK_DEVICE % 10 = 6)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_7 (CHECK(FK_DEVICE % 10 = 7)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_8 (CHECK(FK_DEVICE % 10 = 8)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_9 (CHECK(FK_DEVICE % 10 = 9)) INHERITS (pc_traffic);

... indexes now look like:
CREATE INDEX pc_traffic_6_idx3 ON pc_traffic_6 (fk_device, traffic_date);

To take advantage of the query my SQL now has to include the mod operation
(so the query planner picks up the correct child tables):

select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC
from PC_TRAFFIC where FK_DEVICE = 996 and FK_DEVICE % 10 = 6 and
TRAFFIC_DATE >= '2005-10-14 00:00:00' and TRAFFIC_DATE <= '2005-11-13
23:59:59' group by SOURCE_MAC order by 1 desc

Sorry I would show the plan but I'm rebuilding the dev database atm. It was
faster though and did pick up the correct child table. It was also a bitmap
scan on the index IIRC.

Would I be better off creating many partial indexes instead of multiple
tables AND multiple indexes?
Am I using a horrid method for partitioning the data? (% 10)
Should there be that big of an improvement for multiple tables given that
all the data is still stored on the same filesystem?
Any advice on table splitting much appreciated.

Cheers,

Mike C.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2005-12-12 02:39:08 Re: Table Partitions / Partial Indexes
Previous Message Luke Lonergan 2005-12-11 21:15:14 Re: Should Oracle outperform PostgreSQL on a complex