Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2018-01-29 19:32:20
Message-ID: 20180129193219.fk5vnliembhacuys@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2018-01-29 13:45:10 -0500, Tom Lane wrote:
> Also, the fact that you needed one more ORDER BY in testing on a
> single machine is already proof of my worry about the regression
> tests, and it'll only get worse when testing across a variety
> of machines.

That's an entirely unconvincing argument. A lot of changes to hashtable
code will result in ordering changes, and has done so in the past. Sure
there occasionally was need for a followup ORDER BY addition, but that's
not that bad. The fact that only a single order by was needed when
entirely changing the hash function output (by randomizing it) doesn't
at all seem to suggest that it's a huge problem.

On 2018-01-29 14:11:26 -0500, Tom Lane wrote:
> One other point here is that it's not really clear to me what a randomly
> varying IV is supposed to accomplish. Surely we're not intending that
> it prevents somebody from crafting a data set that causes bad hash
> performance.

I do think we should make that harder.

> If a user with DB access wants to cause a performance
> problem, there are and always will be plenty of other avenues to making
> that happen.

Sure, but that's different from somebody triggering, e.g. by website
input, such behaviour.

> If the idea is that for a data set that otherwise would have
> bad hash performance, choosing a different IV would (almost always) fix
> it, that sounds good but you're ignoring the inverse case: for a data set
> that works fine, there would be some choices of IV that create a problem
> where there was none before. I see no reason to think that the
> probability of the former kind of situation is higher than the latter.

It's easily intentionally triggerable vs. not.

> So I'm on board with using the extended hash functions when available,
> but I'm not convinced that a varying IV buys us anything but trouble.

I think we really need both. A constant IV of e.g. 0 for a single-column
hashtable doesn't really fix much, even though it does improve hash
quality a bit for multi-column indexes. A random IV does address issues
due to a lot of tuples falling into the same bucket, but does not
address actual collisions. Both together are pretty convincing tho.

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Greg Stark 2018-01-29 22:16:24 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Previous Message Andres Freund 2018-01-29 19:26:47 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop