From: | Clem Dickey <dickeycl(at)us(dot)ibm(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time |
Date: | 2011-08-04 01:53:19 |
Message-ID: | j1cu2g$1r76$1@news.hub.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On 08/03/2011 06:29 AM, Robert Haas wrote:
>> b. the Merge Join cost estimator did a poor job with the data it was given:
>>
>> In function eqjoinsel_inner there are two cases (1) ANALYZE data is
>> available for both sides of the join and (2) ANALYZE data is missing for one
>> or both sides. Due to the GROUP BY processing described above, ANALYZE data
>> was available for "t" but not for "SELECT * FROM t GROUP BY ...".
>>
>> The logic in that case is "use the column with the most distinct values" to
>> estimate selectivity. The default number of distinct values for a column
>> with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
>> was:
>>
>> col in GROUP BY in table t
>> j 200 1
>> k 200 1
>> x 200 10
>> y 200 1000
>> z 200 30
>>
>> In 4 of the 5 columns the default value had more distinct values, and the
>> combined selectivity (chance that two arbitrary rows would have a join
>> match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
>> does not distinguish known numbers from default numbers. A comment in the
>> code acknowledges this:
>
> I'm not sure I understand what you're getting at here, unless the idea
> is to make get_variable_numdistinct() somehow indicate to the caller
> whether it had to punt. That might be worth doing.
Yes, the first step is to make "punt" a separate indicator. The second
would be to make good use of that indicator. As it is now, with "punt"
being a possible data value, there two types of errors:
False negative (code treats DEFAULT_NUM_DISTINCT as ordinary case when
it is special):
I wanted eqjoinsel_inner() to treat "punt" specially: to use the value
from the known side of the JOIN when the other side is unknown. The
current behavior, although not ideal, is the expected use of a default
value.
False positive (code treats DEFAULT_NUM_DISTINCT as special case when it
is ordinary):
eqjoinsel_semi() and estimate_hash_bucketsize() treat
DEFAULT_NUM_DISTINCT specially. This behavior is less defensible than
false positive, since a valid numeric value is being re-used as a flag.
I suggest wrapping the value in a struct (to avoid accidental use) and
using macros for read access.
typedef struct {
double value; // negative means "unknown"
} num_distinct_t;
#define IS_NUM_DISTINCT_DEFINED(nd) ((nd).value >= 0)
#define NUM_DISTINCT_VALUE(nd) ((nd).value)
- Clem Dickey
P.S. Congratulations on displacing MySQL in Mac OS X Lion Server.
From | Date | Subject | |
---|---|---|---|
Next Message | Jayadevan M | 2011-08-04 03:42:28 | Re: Parameters for PostgreSQL |
Previous Message | Robert Klemme | 2011-08-03 17:30:46 | Re: Performance penalty when using WITH |