From: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: 9.1 support for hashing arrays |
Date: | 2011-05-20 05:43:39 |
Message-ID: | BANLkTinGShCT-+bBDDuGPP=N3PQKYZ-6eQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 19 May 2011 15:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> The algorithm for this was discussed in the original thread
>> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
>> but I don't that think a satisfactory conclusion was really reached.
>> In particular, it is way too easy to come up with pathological cases
>> that defeat the hashing algorithm, for example:
>>
>> CREATE TABLE foo(a int[][]);
>> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
>> FROM generate_series(1,10000) g(i);
>>
>> All 10000 arrays are different, but they all have the same hash value
>> (0), so if the query optimiser chooses to hash the arrays, the
>> performance will be very poor.
>>
>> A few people on that thread (myself included -
>> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
>> suggested using the multiply-by-31 algorithm but I think I failed to
>> properly make the case for it. Having given it some further thought, I
>> think there are some very sound mathematical reasons why that
>> algorithm performs well:
>>
>> The algorithm is to take the current hash total, multiply it by 31 and
>> then add on the hash of the next element. The final result is a
>> polynomial sum, where each element's hash value is multiplied by a
>> different power of 31.
>>
>> Since this is all modulo 2^32 arithmetic, the powers of 31 will
>> eventually start repeating, and at that point the hashing algorithm
>> could be defeated by transpositions. However, the number 31 has the
>> property that its powers don't repeat for a long time - the powers of
>> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
>> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
>> no smaller (strictly positive) powers of 31 for which this is the
>> case.
>>
>> So the multiply-by-31 algorithm is only vulnerable to transpositions
>> once the arrays reach 134217728 elements.
>>
>> For all smaller arrays, each array element's hash value is multiplied
>> by a number different number from all the other elements, and since
>> all the multipliers are odd numbers, *all* the individual bits from
>> each element's hash value are distributed (differently) in the final
>> value.
>>
>> Of course there are still going to be pathological cases, but they are
>> very difficult to construct deliberately, and extremely unlikely to
>> occur randomly. ISTM that this has all the properties of a good
>> hashing algorithm (possibly the Java folks did a similar analysis and
>> came to the same conclusion).
>
> Yes, I never was very happy with the way that we were doing this, and
> I think you make a coherent argument for why we should do it
> differently.
>
Doh! I forgot one important piece of this algorithm - it is necessary
to initialise the result to something non-zero at the start so that
adding leading nulls to an array changes the final result.
Regards,
Dean
Attachment | Content-Type | Size |
---|---|---|
array-hashing2.patch | application/octet-stream | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Leonardo Francalanci | 2011-05-20 08:37:20 | Re: switch UNLOGGED to LOGGED |
Previous Message | Kevin Grittner | 2011-05-20 02:42:32 | Re: ts_rank |