From: | Anne Rosset <arosset(at)collab(dot)net> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Unexpected query plan results |
Date: | 2009-06-02 15:16:09 |
Message-ID: | 4A254239.7070306@collab.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Robert Haas wrote:
>On Mon, Jun 1, 2009 at 4:53 PM, Anne Rosset <arosset(at)collab(dot)net> wrote:
>
>
>>>On Mon, Jun 1, 2009 at 2:14 PM, Anne Rosset <arosset(at)collab(dot)net> wrote:
>>>
>>>
>>>>SELECT SUM(1) FROM item WHERE is_deleted = 'f'; sum --------- 1824592 (1
>>>>row)
>>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>'; sum --------
>>>>122412 (1 row)
>>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted
>>>>=
>>>>'f'; sum ----- 71 (1 row)
>>>>SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>>></sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted
>>>>=
>>>>'t'; sum -------- 122341 (1 row)
>>>>
>>>>
>>The item table has 2324829 rows
>>
>>
>
>So 1824592/2324829 = 78.4% of the rows have is_deleted = false, and
>0.06709% of the rows have the relevant folder_id. Therefore the
>planner assumes that there will be 2324829 * 78.4% * 0.06709% =~
>96,000 rows that satisfy both criteria (the original explain had
>97,000; there's some variability due to the fact that the analyze only
>samples a random subset of pages), but the real number is 71, leading
>it to make a very bad decision. This is a classic "hidden
>correlation" problem, where two columns are correlated but the planner
>doesn't notice, and you get a terrible plan.
>
>Unfortunately, I'm not aware of any real good solution to this
>problem. The two obvious approaches are multi-column statistics and
>planner hints; PostgreSQL supports neither. There are various
>possible hacks that aren't very satisfying, such as:
>
>1. Redesign the application to put the deleted records in a separate
>table from the non-deleted records. But if the deleted records still
>have child records in other tables, this won't fly due to foreign key
>problems.
>
>2. Inserting a clause that the optimizer doesn't understand to fool it
>into thinking that the scan on the item table is much more selective
>than is exactly the case. I think adding (item.id + 0) = (item.id +
>0) to the WHERE clause will work; the planner will brilliantly
>estimate the selectivity of that expression as one in 200. The
>problem with this is that it will likely lead to a better plan in this
>particular case, but for other folder_ids it may make things worse.
>There's also no guarantee that a future version of PostgreSQL won't be
>smart enough to see through this type of sophistry, though I think
>you're safe as far as the forthcoming 8.4 release is concerned.
>
>3. A hack that makes me gag, but it actually seems to work...
>
>CREATE OR REPLACE FUNCTION item_squash(varchar, boolean) RETURNS varchar[] AS $$
>SELECT array[$1, CASE WHEN $2 THEN 'true' ELSE 'false' END]
>$$ LANGUAGE sql IMMUTABLE;
>
>CREATE INDEX item_squash_idx ON item (item_squash(folder_id, is_deleted));
>
>...and then remove "folder_id = XXX AND is_deleted = YYY" from your
>query and substitute "item_squash(folder_id, is_deleted) =
>item_squash(XXX, YYY)". The expresson index forces the planner to
>gather statistics on the distribution of values for that expression,
>and if you then write a query using that exact same expression the
>planner can take advantage of it.
>
>...Robert
>
>
Thanks a lot Robert. Not sure how we will tackle this but at least now
we have an explanation. From what I read, results won't improved in 8.4.
Is that correct?
Thanks,
Anne
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2009-06-02 15:41:11 | Re: Unexpected query plan results |
Previous Message | Shaul Dar | 2009-06-02 13:31:03 | Re: Best way to load test a postgresql server |