Re: query produces 1 GB temp file

From: Dirk(dot)Lutzebaeck(at)t-online(dot)de (Dirk Lutzebaeck)
To: John A Meinel <john(at)arbash-meinel(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: query produces 1 GB temp file
Date: 2005-02-06 17:18:35
Message-ID: 4206516B.9010704@aeccom.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

John A Meinel wrote:

> Dirk Lutzebaeck wrote:
>
>> Greg Stark wrote:
>>
>>> I gave a bunch of "explain analyze select" commands to test
>>> estimates for
>>> individual columns. What results do they come up with? If those are
>>> inaccurate
>>> then raising the statistics target is a good route. If those are
>>> accurate
>>> individually but the combination is inaccurate then you have a more
>>> difficult
>>> problem.
>>>
>>>
>>>
>> After setting the new statistics target to 200 they did slightly
>> better but not accurate. The results were attached to my last post.
>> Here is a copy:
>>
>>
> It does seem that setting the statistics to a higher value would help.
> Since rc=130170467 seems to account for almost 1/3 of the data.
> Probably you have other values that are much less common. So setting a
> high statistics target would help the planner realize that this value
> occurs at a different frequency from the other ones. Can you try other
> numbers and see what the counts are?

There is not much effect when increasing statistics target much higher.
I guess this is because rc=130170467 takes a large portion of the column
distribution.

> I assume you did do a vacuum analyze after adjusting the statistics
> target.

Yes.

> Also interesting that in the time it took you to place these queries,
> you had received 26 new rows.

Yes, it's a live system...

> And finally, what is the row count if you do
> explain analyze select * from bi where rc=130170467::oid and
> co=117305223::oid;

explain analyze select * from bi where rc=130170467::oid and
co=117305223::oid;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on bi (cost=0.00..43866.19 rows=105544 width=51) (actual
time=0.402..3724.222 rows=513732 loops=1)
Filter: ((rc = 130170467::oid) AND (co = 117305223::oid))

Well both columns data take about 1/4 of the whole table. There is not
much distributed data. So it needs to do full scans...

> If this is a lot less than say 500k, then probably you aren't going to
> be helped a lot. The postgresql statistics engine doesn't generate
> cross column statistics. It always assumes random distribution of
> data. So if two columns are correlated (or anti-correlated), it won't
> realize that.

105k, that seems to be may problem. No much random data. Does 8.0
address this problem?

> Even so, your original desire was to reduce the size of the
> intermediate step (where you have 700k rows). So you need to try and
> design a subselect on bi which is as restrictive as possible, so that
> you don't get all of these rows. With any luck, the planner will
> realize ahead of time that there won't be that many rows, and can use
> indexes, etc. But even if it doesn't use an index scan, if you have a
> query that doesn't use a lot of rows, then you won't need a lot of
> disk space.

I'll try that. What I have already noticed it that one of my output
column is quite large so that's why it uses so much temp space. I'll
probably need to sort without that output column and read it in
afterwards using a subselect on the limted result.

Thanks for your help,

Dirk

>
> John
> =:->
>
>>
>> explain analyze select * from bi where rc=130170467;
>> QUERY PLAN
>> -------------------------------------------------------------------------------------------------------------------
>>
>> Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual
>> time=0.157..3066.028 rows=513724 loops=1)
>> Filter: (rc = 130170467::oid)
>> Total runtime: 4208.663 ms
>> (3 rows)
>>
>>
>> explain analyze select * from bi where co=117305223;
>> QUERY PLAN
>> -------------------------------------------------------------------------------------------------------------------
>>
>> Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual
>> time=0.021..3692.238 rows=945487 loops=1)
>> Filter: (co = 117305223::oid)
>> Total runtime: 5786.268 ms
>> (3 rows)
>>
>> Here is the distribution of the data in bi:
>> select count(*) from bi;
>>
>> 1841966
>>
>>
>> select count(*) from bi where rc=130170467::oid;
>>
>> 513732
>>
>>
>> select count(*) from bi where co=117305223::oid;
>>
>> 945503
>>
>>
>>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Dirk Lutzebaeck 2005-02-06 17:26:30 Re: query produces 1 GB temp file
Previous Message Tom Lane 2005-02-06 17:16:54 Re: query produces 1 GB temp file