From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | Jan Kort <jan(dot)kort(at)genetics(dot)nl> |
Cc: | "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE |
Date: | 2021-05-19 02:13:09 |
Message-ID: | CAApHDvpD1Vuhd3kSKjV-7Ae_8PMPcPG=kxfJOS+ZWVkutCOXgg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Wed, 19 May 2021 at 12:35, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> I'll debug this and see if I can see what's going on.
This is down to the fact that when there are 2 items in the "three"
table we have a histogram in the stats and we can get the upper bound
by just looking at the final histogram bucket.
However, when there are 3 items, or more accurately, there's a
duplicate, we don't build a histogram and just have a
most-common-values list instead. I've not yet checked the logic for
what we include in the MCV list but it appears in this case we only
store the item that's duplicated, in this case, the value 1. We
calculate the end bound of the merge join with that value, which is a
bad choice as that makes it appear that the end bound is very close to
or the same as the start bound, making the merge join seem very cheap.
truncate three;
insert into three values(1,1),(2,1),(3,1000000);
analyze three;
select most_common_vals,histogram_bounds from pg_stats where tablename
= 'three' and attname = 'million_id';
-[ RECORD 1 ]----+----
most_common_vals | {1}
histogram_bounds |
truncate three;
insert into three values(1,1),(3,1000000);
analyze three;
select most_common_vals,histogram_bounds from pg_stats where tablename
= 'three' and attname = 'million_id';
-[ RECORD 1 ]----+------------
most_common_vals |
histogram_bounds | {1,1000000}
I'm not really sure the best way to make this better. It seems like
an unfortunate bad case. I'm not sure if it'd be better to try and be
more inclusive when building MCV lists. But then, what would the logic
be when we don't have enough buckets to store more items.
There is some code in get_variable_range() that's #ifdef'd out to get
the actual range, when possible. There's a comment explaining why we
don't do that.
/*
* XXX It's very tempting to try to use the actual column min and max, if
* we can get them relatively-cheaply with an index probe. However, since
* this function is called many times during join planning, that could
* have unpleasant effects on planning speed. Need more investigation
* before enabling this.
*/
#ifdef NOT_USED
if (get_actual_variable_range(root, vardata, sortop, collation, min, max))
return true;
#endif
For this case, it might be ok to have done that since we have
RestrictInfo.scansel_cache. It does not seem as likely that we'll end
up doing get_actual_variable_range too often. However, there are
likely other cases that are not as well cached which we could make too
slow if we did call get_actual_variable_range.
That makes me think the best fix would be to do something better
during ANALYZE and maybe try and include some more upper bound MCVs.
I'm not yet too sure what drawbacks there might be from doing that.
David
From | Date | Subject | |
---|---|---|---|
Next Message | 李可强 | 2021-05-19 05:04:07 | Re:Fwd: BUG #17017: Two versions of the same row of records are returned in one query |
Previous Message | David Rowley | 2021-05-19 00:35:05 | Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE |