Re: distinct estimate of a hard-coded VALUES list

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: distinct estimate of a hard-coded VALUES list
Date: 2016-08-22 22:38:53
Message-ID: 27c2ed94-1cb3-45d8-36b9-0734c1a3dce0@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/22/2016 07:42 PM, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>>>> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>>> It does know it, what it doesn't know is how many duplicates there are.
>>>
>>>> Does it know whether the count comes from a parsed query-string list/array,
>>>> rather than being an estimate from something else? If it came from a join,
>>>> I can see why it would be dangerous to assume they are mostly distinct.
>>>> But if someone throws 6000 things into a query string and only 200 distinct
>>>> values among them, they have no one to blame but themselves when it makes
>>>> bad choices off of that.
>>>
>>> I am not exactly sold on this assumption that applications have
>>> de-duplicated the contents of a VALUES or IN list. They haven't been
>>> asked to do that in the past, so why do you think they are doing it?
>>
>> It's hard to know, but my intuition is that most people would
>> deduplicate. I mean, nobody is going to want to their query generator
>> to send X IN (1, 1, <repeat a zillion more times>) to the server if it
>> could have just sent X IN (1).
>
> Also, if we patch it this way and somebody has a slow query because of a
> lot of duplicate values, it's easy to solve the problem by
> de-duplicating. But with the current code, people that have the
> opposite problem has no way to work around it.
>

I certainly agree it's better when a smart user can fix his query plan
by deduplicating the values than when we end up generating a poor plan
due to assuming some users are somewhat dumb.

I wonder how expensive would it be to actually count the number of
distinct values - there certainly are complex data types where the
comparisons are fairly expensive, but I would not expect those to be
used in explicit VALUES lists. Also, maybe there's some sufficiently
accurate estimation approach - e.g. for small number of values we can
compute the number of distinct values directly (and it's still going to
be fairly cheap), while for larger number we could probably sample the
values similarly to what ANALYZE does.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2016-08-22 22:47:37 Re: UTF-8 docs?
Previous Message Gabriele Bartolini 2016-08-22 22:34:58 pg_receivexlog does not report flush position with --synchronous