Quick Links

Re: Performance problem with low correlation data

From:	m_lists(at)yahoo(dot)it
To:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Performance problem with low correlation data
Date:	2009-07-09 07:38:56
Message-ID:	486059.1675.qm@web24607.mail.ird.yahoo.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Since noone replied to http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg133360.html, I tried another approach:

I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of them).
Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last "cluster" command (since no row is deleted/updated, only inserted each 15 minutes).
So I thought: I'll "cluster" only the part that has been written every day:

begin;
lock table testinsert in ACCESS EXCLUSIVE MODE;
insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' order by ne_id,t;
DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000;
update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00';
commit;

this would run after midnight of 2009-08-02. Next day would have different time values.
What I'm trying to do here is cluster on ne_id,t the portion of table written every day.
Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low:
select attname,n_distinct,correlation from pg_stats where tablename='testinsert3';
attname | n_distinct | correlation
---------+------------+-------------
ne_id   | 20000 | 0.111041 <---- low value
t    | 864 | 0.987778
v    | 1 |    1

this leads the planner to sequence scans of the table as soon as 10% of the table has to be read:
explain select * FROM idtable as g inner join testinsert on id=ne_id where groupid between 1 and 4 and t between '2009-08-01 00:00:00' and '2009-08-09 00:00:00'
Hash Join (cost=134.45..2127071.28 rows=614331 width=244)
   Hash Cond: (testinsert3.ne_id = g.id)
   -> Seq Scan on testinsert (cost=0.00..2063200.00 rows=15358272 width=236)
   Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09 00:00:00'::timestamp without time zone))
   -> Hash (cost=124.45..124.45 rows=800 width=8)
   -> Bitmap Heap Scan on idtable g (cost=24.45..124.45 rows=800 width=8)
Recheck Cond: ((groupid >= 1) AND (groupid <= 4))
   -> Bitmap Index Scan on idtable_pk (cost=0.00..24.25 rows=800 width=0)
   Index Cond: ((groupid >= 1) AND (groupid <= 4))

Which is a terrible plan!
testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of 20000 ne_id have to be read; there's no need for a table scan!
I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad correlated.
In fact, doing a "select ne_id, t from testinsert limit 100000" I can see that data is laid out pretty much by "ne_id, t", grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day).
How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???

Responses

Re: Performance problem with low correlation data at 2009-07-09 17:36:45 from Alvaro Herrera

Browse pgsql-general by date

	From	Date	Subject
Next Message	John R Pierce	2009-07-09 07:43:48	Re: c++ program to connect to postgre database
Previous Message	Albe Laurenz	2009-07-09 06:54:00	Re: sslv3 alert illegal parameter