Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2017-12-19 18:52:19
Message-ID: 0f552dee-2b87-2a74-7862-7b0ff095fc82@blackducksoftware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 12/10/17 18:14, Tomas Vondra wrote:
>
> I've done a simple experiment today - computing a hash for every uint32
> value, and checking how many distinct hash values that produces. For the
> 4.2 billion values the results look like this:
>
> second key ndistinct ndistinct/nkeys
> 3.380 i=100000000 nset=98829159 0.99
> 6.240 i=200000000 nset=195351181 0.98
> 9.106 i=300000000 nset=289623103 0.97
> 11.932 i=400000000 nset=381695003 0.95
> 14.814 i=500000000 nset=471621355 0.94
> 17.706 i=600000000 nset=559452278 0.93
> 20.577 i=700000000 nset=645242762 0.92
> 23.496 i=800000000 nset=729036217 0.91
> 26.460 i=900000000 nset=810879821 0.90
> 29.380 i=1000000000 nset=890818778 0.89
> 32.331 i=1100000000 nset=968898478 0.88
> 35.282 i=1200000000 nset=1045164189 0.87
> 38.262 i=1300000000 nset=1119654810 0.86
> 41.251 i=1400000000 nset=1192424766 0.85
> 44.258 i=1500000000 nset=1263482593 0.84
> 47.268 i=1600000000 nset=1332897449 0.83
> 50.305 i=1700000000 nset=1400717923 0.82
> 53.356 i=1800000000 nset=1466962823 0.81
> 56.425 i=1900000000 nset=1531660191 0.81
> 59.489 i=2000000000 nset=1594856205 0.80
> 62.593 i=2100000000 nset=1656588855 0.79
> 65.706 i=2200000000 nset=1716895530 0.78
> 68.829 i=2300000000 nset=1775809288 0.77
> 71.966 i=2400000000 nset=1833353377 0.76
> 75.123 i=2500000000 nset=1889558599 0.76
> 78.271 i=2600000000 nset=1944463039 0.75
> 81.418 i=2700000000 nset=1998096496 0.74
> 84.574 i=2800000000 nset=2050490745 0.73
> 87.711 i=2900000000 nset=2101666393 0.72
> 90.852 i=3000000000 nset=2151669155 0.72
> 93.986 i=3100000000 nset=2200517580 0.71
> 97.084 i=3200000000 nset=2248232772 0.70
> 100.172 i=3300000000 nset=2294849331 0.70
> 103.232 i=3400000000 nset=2340389131 0.69
> 106.273 i=3500000000 nset=2384875319 0.68
> 109.272 i=3600000000 nset=2428339639 0.67
> 112.260 i=3700000000 nset=2470798655 0.67
> 115.199 i=3800000000 nset=2512271730 0.66
> 118.125 i=3900000000 nset=2552788321 0.65
> 121.029 i=4000000000 nset=2592379529 0.65
> 123.895 i=4100000000 nset=2631056297 0.64
> 126.726 i=4200000000 nset=2668843329 0.64
> 129.397 loops 4294967295 set 2703915179 (0.63)
>
> That means we only ever generate about 64% of the possible hash keys.
> That doesn't seem particularly healthy I guess, but for small hash
> tables (with fewer than 2^32 buckets) that may not be an issue, as some
> of the values will wrap because of the modulo.

FWIW, changing hashint8() to

int64 val = PG_GETARG_INT64(0);

if (val <= UINT32_MAX && val >= 0)
return hash_uint32((uint32) val);
else
return hash_any((unsigned char *) &val, sizeof(val));

fixes the problem for me on all of my data sets. My conjecture is that the
existing implementation

uint32 lohalf = (uint32) val;
uint32 hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

biases lohalf when hashing collections of large-magnitude negative numbers
(like my data sets) because the individual bits of lohalf do not have equal
probability of being toggled by the exclusive-or. For example, given a
value of -9223261146507558080 (0x8000770b48a68340), 15 of the 16 most
significant bits in lohalf will toggle, but only half of the least significant
will toggle.

-- todd

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Robert Haas 2017-12-19 20:01:03 Re: vacuum vs heap_update_tuple() and multixactids
Previous Message Andres Freund 2017-12-19 18:42:11 Re: vacuum vs heap_update_tuple() and multixactids