From: | John A Meinel <john(at)arbash-meinel(dot)com> |
---|---|
To: | lutzeb(at)aeccom(dot)com |
Cc: | Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us |
Subject: | Re: query produces 1 GB temp file |
Date: | 2005-02-06 16:46:06 |
Message-ID: | 420649CE.5030101@arbash-meinel.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Dirk Lutzebaeck wrote:
> Greg Stark wrote:
>
>> I gave a bunch of "explain analyze select" commands to test estimates
>> for
>> individual columns. What results do they come up with? If those are
>> inaccurate
>> then raising the statistics target is a good route. If those are
>> accurate
>> individually but the combination is inaccurate then you have a more
>> difficult
>> problem.
>>
>>
>>
> After setting the new statistics target to 200 they did slightly
> better but not accurate. The results were attached to my last post.
> Here is a copy:
>
>
It does seem that setting the statistics to a higher value would help.
Since rc=130170467 seems to account for almost 1/3 of the data. Probably
you have other values that are much less common. So setting a high
statistics target would help the planner realize that this value occurs
at a different frequency from the other ones. Can you try other numbers
and see what the counts are?
I assume you did do a vacuum analyze after adjusting the statistics target.
Also interesting that in the time it took you to place these queries,
you had received 26 new rows.
And finally, what is the row count if you do
explain analyze select * from bi where rc=130170467::oid and
co=117305223::oid;
If this is a lot less than say 500k, then probably you aren't going to
be helped a lot. The postgresql statistics engine doesn't generate cross
column statistics. It always assumes random distribution of data. So if
two columns are correlated (or anti-correlated), it won't realize that.
Even so, your original desire was to reduce the size of the intermediate
step (where you have 700k rows). So you need to try and design a
subselect on bi which is as restrictive as possible, so that you don't
get all of these rows. With any luck, the planner will realize ahead of
time that there won't be that many rows, and can use indexes, etc. But
even if it doesn't use an index scan, if you have a query that doesn't
use a lot of rows, then you won't need a lot of disk space.
John
=:->
>
> explain analyze select * from bi where rc=130170467;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
>
> Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual
> time=0.157..3066.028 rows=513724 loops=1)
> Filter: (rc = 130170467::oid)
> Total runtime: 4208.663 ms
> (3 rows)
>
>
> explain analyze select * from bi where co=117305223;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
>
> Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual
> time=0.021..3692.238 rows=945487 loops=1)
> Filter: (co = 117305223::oid)
> Total runtime: 5786.268 ms
> (3 rows)
>
> Here is the distribution of the data in bi:
> select count(*) from bi;
>
> 1841966
>
>
> select count(*) from bi where rc=130170467::oid;
>
> 513732
>
>
> select count(*) from bi where co=117305223::oid;
>
> 945503
>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Steven Rosenstein | 2005-02-06 17:16:13 | Are JOINs allowed with DELETE FROM |
Previous Message | Jan Wieck | 2005-02-06 16:42:41 | Re: PostgreSQL clustering VS MySQL clustering |