From: | "Chris Kratz" <chris(dot)kratz(at)vistashare(dot)com> |
---|---|
To: | Decibel! <decibel(at)decibel(dot)org> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Incorrect estimates on correlated filters |
Date: | 2008-08-13 18:45:05 |
Message-ID: | 3642025c0808131145x61b9aec5l2d9cb0268d28e9ff@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Wed, Aug 13, 2008 at 10:59 AM, Decibel! <decibel(at)decibel(dot)org> wrote:
> On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:
>
>> Ran into a re-occuring performance problem with some report queries again
>> today. In a nutshell, we have filters on either multiple joined tables, or
>> multiple columns on a single table that are highly correlated. So, the
>> estimates come out grossly incorrect (the planner has no way to know they
>> are correlated). 2000:1 for one I'm looking at right now. Generally this
>> doesn't matter, except in complex reporting queries like these when this is
>> the first join of 40 other joins. Because the estimate is wrong at the
>> lowest level, it snowballs up through the rest of the joins causing the
>> query to run very, very slowly. In many of these cases, forcing nested
>> loops off for the duration of the query fixes the problem. But I have a
>> couple that still are painfully slow and shouldn't be.
>>
>> I've been reading through the archives with others having similar problems
>> (including myself a year ago). Am I right in assuming that at this point
>> there is still little we can do in postgres to speed up this kind of query?
>> Right now the planner has no way to know the correlation between different
>> columns in the same table, let alone columns in different tables. So, it
>> just assumes no correlation and returns incorrectly low estimates in cases
>> like these.
>>
>> The only solution I've come up with so far is to materialize portions of
>> the larger query into subqueries with these correlated filters which are
>> indexed and analyzed before joining into the larger query. This would keep
>> the incorrect estimates from snowballing up through the chain of joins.
>>
>> Are there any other solutions to this problem?
>>
>
>
> Well... you could try and convince certain members of the community that we
> actually do need some kind of a query hint mechanism... ;)
>
> I did make a suggestion a few months ago that involved sorting a table on
> different columns and recording the correlation of other columns. The scheme
> isn't perfect, but it would help detect cases like a field populated by a
> sequence and another field that's insert timestamp; those two fields would
> correlate highly, and you should even be able to correlate the two
> histograms; that would allow you to infer that most of the insert times for
> _id's between 100 and 200 will be between 2008-01-01 00:10 and 2008-01-01
> 00:20, for example.
> --
> Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
> Give your computer some brain candy! www.distributed.net Team #1828
>
>
> Thanks for the reply,
Yes, I know hints are frowned upon around here. Though, I'd love to have
them or something equivalent on this particular query just so the customer
can run their important reports. As it is, it's unrunnable.
Unfortunately, if I don't think the sorting idea would help in the one case
I'm looking at which involves filters on two tables that are joined
together. The filters happen to be correlated such that about 95% of the
rows from each filtered table are actually returned after the join.
Unfortunately, the planner thinks we will get 1 row back.
I do have to find a way to make these queries runnable. I'll keep looking.
Thanks,
-Chris
From | Date | Subject | |
---|---|---|---|
Next Message | Henrik | 2008-08-13 19:54:43 | Re: Filesystem benchmarking for pg 8.3.3 server |
Previous Message | Greg Smith | 2008-08-13 16:55:38 | Re: Filesystem benchmarking for pg 8.3.3 server |