From: | m_lists(at)yahoo(dot)it |
---|---|
To: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Performance problem with low correlation data |
Date: | 2009-07-09 07:38:56 |
Message-ID: | 486059.1675.qm@web24607.mail.ird.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Since noone replied to http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg133360.html, I tried another approach:
I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of them).
Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last "cluster" command (since no row is deleted/updated, only inserted each 15 minutes).
So I thought: I'll "cluster" only the part that has been written every day:
begin;
lock table testinsert in ACCESS EXCLUSIVE MODE;
insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' order by ne_id,t;
DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000;
update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00';
commit;
this would run after midnight of 2009-08-02. Next day would have different time values.
What I'm trying to do here is cluster on ne_id,t the portion of table written every day.
Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low:
select attname,n_distinct,correlation from pg_stats where tablename='testinsert3';
attname | n_distinct | correlation
---------+------------+-------------
ne_id | 20000 | 0.111041 <---- low value
t | 864 | 0.987778
v | 1 | 1
this leads the planner to sequence scans of the table as soon as 10% of the table has to be read:
explain select * FROM idtable as g inner join testinsert on id=ne_id where groupid between 1 and 4 and t between '2009-08-01 00:00:00' and '2009-08-09 00:00:00'
Hash Join (cost=134.45..2127071.28 rows=614331 width=244)
Hash Cond: (testinsert3.ne_id = g.id)
-> Seq Scan on testinsert (cost=0.00..2063200.00 rows=15358272 width=236)
Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09 00:00:00'::timestamp without time zone))
-> Hash (cost=124.45..124.45 rows=800 width=8)
-> Bitmap Heap Scan on idtable g (cost=24.45..124.45 rows=800 width=8)
Recheck Cond: ((groupid >= 1) AND (groupid <= 4))
-> Bitmap Index Scan on idtable_pk (cost=0.00..24.25 rows=800 width=0)
Index Cond: ((groupid >= 1) AND (groupid <= 4))
Which is a terrible plan!
testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of 20000 ne_id have to be read; there's no need for a table scan!
I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad correlated.
In fact, doing a "select ne_id, t from testinsert limit 100000" I can see that data is laid out pretty much by "ne_id, t", grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day).
How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???
From | Date | Subject | |
---|---|---|---|
Next Message | John R Pierce | 2009-07-09 07:43:48 | Re: c++ program to connect to postgre database |
Previous Message | Albe Laurenz | 2009-07-09 06:54:00 | Re: sslv3 alert illegal parameter |