From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Performance regression from 8.3.7 to 9.0.3 |
Date: | 2011-03-14 17:34:31 |
Message-ID: | AANLkTikM2u89RntrMXeNsWncoJWT91UOxtcRXA_fOtjE@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Mon, Mar 14, 2011 at 10:54 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> Nothing?
>
> No ideas?
>
> Did I forget to include some useful bit?
>
> On Fri, Mar 4, 2011 at 8:22 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> Hello, first post to this list.
>>
>> I have this query that ran in milliseconds in postgres 8.3.7 (usually 50,
>> 100ms), and now it takes a full 10 minutes to complete.
>>
>> I tracked the problem to the usage of hash aggregates to resolve EXISTS
>> clauses in conjunction with large IN clauses, which seem to reverse the
>> execution plan - in 8.3.7, it would use indices to fetch the rows from the
>> IN, then compute the exists with a nested loop, never doing big sequential
>> scans. In 9.0.3, it computes the set of applicable entries with a hash
>> aggregate, but in order to do that it performs a huge index scan - no
>> sequential scans either, but the big index scan is worse.
>>
>> 9.0.3 always misses the estimate of how many rows will come out the hash
>> aggregate, always estimating 200, while in fact the real count is more like
>> 300.000. I've tried increasing statistics in all the columns involved, up to
>> 4000 for each, to the point where it accurately estimates the input to the
>> hash agg, but the output is always estimated to be 200 rows.
>>
>> Rewriting the query to use 0 < (select count(*)..) instead of EXISTS (select
>> * ..) does revert to the old postgres 8.3 plan, although intuitively I would
>> think it to be sub-optimal.
>>
>> The tables in question receive many updates, but never in such a volume as
>> to create enough bloat - plus, the tests I've been running are on a
>> pre-production server without much traffic (so not many updates - probably
>> none in weeks).
>>
>> The server is a Core 2 E7400 dual core with 4GB of ram running linux and a
>> pg 9.0.3 / 8.3.7 (both there, doing migration testing) built from source.
>> Quite smaller than our production server, but I've tested the issue on
>> higher-end hardware and it produces the same results.
>>
>> Any ideas as to how to work around this issue?
>>
>> I can't plug the select count() version everywhere, since I won't be using
>> this form of the query every time (it's generated programatically with an
>> ORM), and some forms of it perform incredibly worse with the select count().
>>
>> Also, any help I can provide to fix it upstream I'll be glad to - I believe
>> (I would have to check) I can even create a dump of the tables (stripping
>> sensitive info of course) - only, well, you'll see the size below - a tad
>> big to be mailing it ;-)
>>
>> pg 9.0 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 512M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>> pg 8.3.7 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 100M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>>
>> The query in question:
>>
>>> SELECT member_statistics.member_id
>>> FROM member_statistics
>>> WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
>>> (SELECT mat1.tag_id
>>> FROM member_all_tags_v AS mat1
>>> WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
>>> IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))
hm the regression in and of itself is interesting, but I wonder if you
can get past your issue like this:
SELECT member_statistics.member_id
FROM member_statistics
WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
(SELECT mat1.tag_id
FROM member_all_tags_v AS mat1
WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))
changes to:
SELECT member_statistics.member_id
FROM member_statistics
WHERE EXISTS
(
SELECT mat1.tag_id
FROM member_all_tags_v AS mat1
WHERE mat1.member_id = member_statistics.member_id
AND mat1.tag_id
IN (640, 641, 3637, 3638, 637, 638, 639) AND
mat1.polarity >= 90
AND mat1.member_id IN ( <<400 ids>> )
)
also, always try to compare vs straight join version:
SELECT member_statistics.member_id
FROM member_statistics
JOIN VALUES ( <<400 ids>> ) q(member_id) using (member_id)
JOIN
(
SELECT mat1.member_id
FROM member_all_tags_v AS mat1
WHERE mat1.tag_id IN (640, 641, 3637, 3638, 637, 638, 639)
AND mat1.polarity >= 90) p
USING(member_id)
) p using(member_id);
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2011-03-14 17:59:59 | Re: Performance regression from 8.3.7 to 9.0.3 |
Previous Message | Tom Lane | 2011-03-14 17:24:33 | Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1 |