From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop |
Date: | 2017-12-06 19:32:24 |
Message-ID: | 263b03b1-3e1c-49ca-165a-8ac6751419c4@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 12/05/2017 11:01 PM, Tomas Vondra wrote:
>
> Based on the initial discussion, the problem here is twofold.
>
> Firstly, the code assumes that if it gets certain number of bucket
> collisions (essentially, the initial bucket and certain number of
> neighboring buckets already being full), making the table larger is
> guaranteed to reduce the number of collisions.
>
> Which is obviously untrue for duplicate hash values, but it may also
> happen for distinct hash values that form a chain (think a sequence of
> hash values 1,2,3,4,5,6,7,...,K - that is never gonna get fixed).
>
I've managed to construct (by sheer brute force) an example that breaks
simplehash in this way. It triggers the growth by having a sequence of
distinct but consecutive hash values, e.g. 1, 2, 3, 4, ..., 1000, and
then trying to insert a value with duplicate hash (say, 4).
Which will trigger the "emptydist" condition (as opposed to
"insertdist", as in the other example).
This means there are very few collisions (no hash value matching more
than 2 distinct values), and all the hash values are "at the beginning
of the range" so growing the hash table can't possible break the hash
sequences.
The generated datasets are quite large (16MB each), so I've placed them
here (along with the ugly C code to generate it):
https://github.com/tvondra/simplehash-break
Each CSV file contains ~1M of strings with distinct hashes between 0 and
1048576 (essentially, covering about 97.5% of the range). There are
"chains" of up to ~300 consecutive hash values. This works:
CREATE TABLE hash_text (val text);
COPY hash_text FROM '/tmp/data1.csv';
SET work_mem = '1GB';
SELECT DISTINCT val FROM hash_text;
but this does not:
CREATE TABLE hash_text (val text);
COPY hash_text FROM '/tmp/data1.csv';
COPY hash_text FROM '/tmp/data2.csv';
SET work_mem = '1GB';
SELECT DISTINCT val FROM hash_text;
In fact, only a small subset of the second data set should be needed.
FWIW I first tried to do the same thing with bigint values, but no
matter what I did I've been unable to generate a dataset covering more
than ~60% of the range (essentially hashint8 only ever produces 60% of
values from [0,1000000], which likely increases collision rate).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2017-12-06 19:35:45 | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop |
Previous Message | Todd A. Cook | 2017-12-06 19:04:56 | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop |