Re: FETCH FIRST clause WITH TIES option

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Surafel Temesgen <surafel3000(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, andrew(at)tao11(dot)riddles(dot)org(dot)uk, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FETCH FIRST clause WITH TIES option
Date: 2019-01-15 11:51:29
Message-ID: c1430163-a5b0-2adb-c81c-29d1029cb8cf@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/15/19 11:07 AM, Surafel Temesgen wrote:
>
>
> On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
> patch should tweak estimates in some way. Currently, the cardinality
> estimate is the same as for plain LIMIT, using the requested number of
> rows. But let's say there are very few large groups - that will
> naturally increase the number of rows produced.
>
> As an example, let's say the subplan produces 1M rows, and there are
> 1000 groups (when split according to the ORDER BY clause).
>
>
>  
>
> can we use ORDER BY column raw statistic in limit node reliably?
> because it seems to me it can be affected by other operation in a
> subplan like filter condition
>

What do you mean by "raw statistic"? Using stats from the underlying
table is not quite possible, because you might be operating on top of
join or something like that.

IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.

But maybe we can do better when there really is a single table to
consider, in which case we might look at MCV lists and estimate the
largest group. That would give us a much better idea of the worst case.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2019-01-15 12:29:30 Re: Pluggable Storage - Andres's take
Previous Message Peter Eisentraut 2019-01-15 11:06:48 Re: SPI Interface to Call Procedure with Transaction Control Statements?