Quick Links

Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Clem Dickey <dickeycl(at)us(dot)ibm(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time
Date:	2011-08-03 13:29:16
Message-ID:	CA+TgmoaZtmkotTT-PMB7iNupzrHEe4Hp+MRpD7zcfcxmrLqStA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Fri, Jul 8, 2011 at 9:33 PM, Clem Dickey <dickeycl(at)us(dot)ibm(dot)com> wrote:
> a. The Join cost estimators could have been given more information
>
> The functions which estimate JOIN selectivity (e.g. the chance that tuples
> will match in an equijoin, for instance) use data produced by ANALYZE. But
> the SELECT .. GROUP BY does not propagate ANALYZE data from the columns of
> its input relation to its output relation. That is too bad, because the
> column value statistics (number of unique values) would have improved
> selectivity estimates for all three join plans (merge join, nested loop, and
> hash join).

Yeah, I've had this same thought. In fact, I think that it would
probably be an improvement to pass through not just the number of
unique values but the MCVs and frequencies of the non-GROUP-BY
columns. Of course, for the grouping columns, we ought to let
n_distinct = -1 pop out. Granted, the GROUP BY might totally change
the data distribution, so relying on the input column statistics to be
meaningful could be totally wrong, but on average it seems more likely
to give a useful answer than a blind stab in the dark. I haven't
gotten around to doing anything about this, but it seems like a good
idea.

> b. the Merge Join cost estimator did a poor job with the data it was given:
>
> In function eqjoinsel_inner there are two cases (1) ANALYZE data is
> available for both sides of the join and (2) ANALYZE data is missing for one
> or both sides. Due to the GROUP BY processing described above, ANALYZE data
> was available for "t" but not for "SELECT * FROM t GROUP BY ...".
>
> The logic in that case is "use the column with the most distinct values" to
> estimate selectivity. The default number of distinct values for a column
> with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
> was:
>
> col in GROUP BY in table t
> j 200 1
> k 200 1
> x 200 10
> y 200 1000
> z 200 30
>
> In 4 of the 5 columns the default value had more distinct values, and the
> combined selectivity (chance that two arbitrary rows would have a join
> match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
> does not distinguish known numbers from default numbers. A comment in the
> code acknowledges this:
>
> "XXX Can we be smarter if we have an MCV list for just one side?"
>
> But it concludes
>
> "It seems that if we assume equal distribution for the other side, we end up
> with the same answer anyway."
>
> I don't think that is the case. Preferring a known value, where one exists,
> would provide a better estimate of the actual range of the data. Indeed, the
> var_eq_non_const in the same file (used by the nested loop join estimator)
> does essentially that.

I'm not sure I understand what you're getting at here, unless the idea
is to make get_variable_numdistinct() somehow indicate to the caller
whether it had to punt. That might be worth doing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time at 2011-07-09 01:33:15 from Clem Dickey

Responses

Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time at 2011-08-04 01:53:19 from Clem Dickey

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Robert Ayrapetyan	2011-08-03 14:39:16	Re: Performance die when COPYing to table with bigint PK
Previous Message	Li Jin	2011-08-03 13:27:11	Re: Performance penalty when using WITH