Re: Performance regression from 8.3.7 to 9.0.3

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance regression from 8.3.7 to 9.0.3
Date: 2011-03-14 17:34:31
Message-ID: AANLkTikM2u89RntrMXeNsWncoJWT91UOxtcRXA_fOtjE@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Mar 14, 2011 at 10:54 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> Nothing?
>
> No ideas?
>
> Did I forget to include some useful bit?
>
> On Fri, Mar 4, 2011 at 8:22 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> Hello, first post to this list.
>>
>> I have this query that ran in milliseconds in postgres 8.3.7 (usually 50,
>> 100ms), and now it takes a full 10 minutes to complete.
>>
>> I tracked the problem to the usage of hash aggregates to resolve EXISTS
>> clauses in conjunction with large IN clauses, which seem to reverse the
>> execution plan - in 8.3.7, it would use indices to fetch the rows from the
>> IN, then compute the exists with a nested loop, never doing big sequential
>> scans. In 9.0.3, it computes the set of applicable entries with a hash
>> aggregate, but in order to do that it performs a huge index scan - no
>> sequential scans either, but the big index scan is worse.
>>
>> 9.0.3 always misses the estimate of how many rows will come out the hash
>> aggregate, always estimating 200, while in fact the real count is more like
>> 300.000. I've tried increasing statistics in all the columns involved, up to
>> 4000 for each, to the point where it accurately estimates the input to the
>> hash agg, but the output is always estimated to be 200 rows.
>>
>> Rewriting the query to use 0 < (select count(*)..) instead of EXISTS (select
>> * ..) does revert to the old postgres 8.3 plan, although intuitively I would
>> think it to be sub-optimal.
>>
>> The tables in question receive many updates, but never in such a volume as
>> to create enough bloat - plus, the tests I've been running are on a
>> pre-production server without much traffic (so not many updates - probably
>> none in weeks).
>>
>> The server is a Core 2 E7400 dual core with 4GB of ram running linux and a
>> pg 9.0.3 / 8.3.7 (both there, doing migration testing) built from source.
>> Quite smaller than our production server, but I've tested the issue on
>> higher-end hardware and it produces the same results.
>>
>> Any ideas as to how to work around this issue?
>>
>> I can't plug the select count() version everywhere, since I won't be using
>> this form of the query every time (it's generated programatically with an
>> ORM), and some forms of it perform incredibly worse with the select count().
>>
>> Also, any help I can provide to fix it upstream I'll be glad to - I believe
>> (I would have to check) I can even create a dump of the tables (stripping
>> sensitive info of course) - only, well, you'll see the size below - a tad
>> big to be mailing it ;-)
>>
>> pg 9.0 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 512M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>> pg 8.3.7 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 100M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>>
>> The query in question:
>>
>>> SELECT member_statistics.member_id
>>>         FROM member_statistics
>>>         WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
>>> (SELECT mat1.tag_id
>>>         FROM member_all_tags_v AS mat1
>>>         WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
>>> IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))

hm the regression in and of itself is interesting, but I wonder if you
can get past your issue like this:

SELECT member_statistics.member_id
FROM member_statistics
WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
(SELECT mat1.tag_id
FROM member_all_tags_v AS mat1
WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))

changes to:

SELECT member_statistics.member_id
FROM member_statistics
WHERE EXISTS
(
SELECT mat1.tag_id
FROM member_all_tags_v AS mat1
WHERE mat1.member_id = member_statistics.member_id
AND mat1.tag_id
IN (640, 641, 3637, 3638, 637, 638, 639) AND
mat1.polarity >= 90
AND mat1.member_id IN ( <<400 ids>> )
)

also, always try to compare vs straight join version:

SELECT member_statistics.member_id
FROM member_statistics
JOIN VALUES ( <<400 ids>> ) q(member_id) using (member_id)
JOIN
(
SELECT mat1.member_id
FROM member_all_tags_v AS mat1
WHERE mat1.tag_id IN (640, 641, 3637, 3638, 637, 638, 639)
AND mat1.polarity >= 90) p
USING(member_id)
) p using(member_id);

merlin

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2011-03-14 17:59:59 Re: Performance regression from 8.3.7 to 9.0.3
Previous Message Tom Lane 2011-03-14 17:24:33 Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1