Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2017-11-28 04:51:50
Message-ID: CAEepm=10VqZm7JF2YSC9CV2uYzVvVeVUB2yGtXYf5TZGmCsSmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Nov 28, 2017 at 5:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> You beat me to it --- after looking at simplehash.h I'd guessed that
> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
> an infinite loop, but I'd not gotten to determining which one yet.
>
> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well. Neither
> of them are obviously loop-proof.
>
> Note that the sample data has a lot of collisions:
>
> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
> hashint8 | count
> -------------+-------
> 441526644 | 2337
> -1117776826 | 1221
> -1202007016 | 935
> -2068831050 | 620
> 1156644653 | 538
> 553783815 | 510
> 259780770 | 444
> 371047036 | 394
> 915722575 | 359
> ... etc etc ...
>
> It's evidently more complicated than just that the code fails with
> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
> by a lot. There needs to be a safety valve that prevents letting
> the hash fill factor approach zero, which is what's happening in
> this test case.

Yeah. If you count distinct hashint8(val) of *distinct* values, you get:

postgres=# select hashint8(val), count(*) from (select distinct val
from reproducer) ss group by 1 order by 2 desc limit 1;
hashint8 | count
------------+-------
1288181237 | 26
(1 row)

It doesn't matter how many bits of that you use, you'll get at least
26 collisions, so our loop can't terminate just by increasing the mask
size. My guess is that you'd either need a defence based on something
like a secondary hash function, or at least a dynamic SH_GROW_MAX_DIB
that wouldn't allow the fill factor to plummet as you say. A dataset
could be found that would exceed any static value of SH_GROW_MAX_DIB
by brute force. In this case, considering the definition of hashint8,
there are more straightforward ways to find distinct bigint values
that hash to the same value... just swap some bits between the high
and low words, or something like that.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2017-11-28 13:08:10 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Previous Message Tom Lane 2017-11-28 04:03:03 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop