Re: 9.1 support for hashing arrays

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.1 support for hashing arrays
Date: 2011-05-20 05:43:39
Message-ID: BANLkTinGShCT-+bBDDuGPP=N3PQKYZ-6eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 May 2011 15:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> The algorithm for this was discussed in the original thread
>> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
>> but I don't that think a satisfactory conclusion was really reached.
>> In particular, it is way too easy to come up with pathological cases
>> that defeat the hashing algorithm, for example:
>>
>> CREATE TABLE foo(a int[][]);
>> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
>>  FROM generate_series(1,10000) g(i);
>>
>> All 10000 arrays are different, but they all have the same hash value
>> (0), so if the query optimiser chooses to hash the arrays, the
>> performance will be very poor.
>>
>> A few people on that thread (myself included -
>> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
>> suggested using the multiply-by-31 algorithm but I think I failed to
>> properly make the case for it. Having given it some further thought, I
>> think there are some very sound mathematical reasons why that
>> algorithm performs well:
>>
>> The algorithm is to take the current hash total, multiply it by 31 and
>> then add on the hash of the next element. The final result is a
>> polynomial sum, where each element's hash value is multiplied by a
>> different power of 31.
>>
>> Since this is all modulo 2^32 arithmetic, the powers of 31 will
>> eventually start repeating, and at that point the hashing algorithm
>> could be defeated by transpositions. However, the number 31 has the
>> property that its powers don't repeat for a long time - the powers of
>> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
>> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
>> no smaller (strictly positive) powers of 31 for which this is the
>> case.
>>
>> So the multiply-by-31 algorithm is only vulnerable to transpositions
>> once the arrays reach 134217728 elements.
>>
>> For all smaller arrays, each array element's hash value is multiplied
>> by a number different number from all the other elements, and since
>> all the multipliers are odd numbers, *all* the individual bits from
>> each element's hash value are distributed (differently) in the final
>> value.
>>
>> Of course there are still going to be pathological cases, but they are
>> very difficult to construct deliberately, and extremely unlikely to
>> occur randomly. ISTM that this has all the properties of a good
>> hashing algorithm (possibly the Java folks did a similar analysis and
>> came to the same conclusion).
>
> Yes, I never was very happy with the way that we were doing this, and
> I think you make a coherent argument for why we should do it
> differently.
>

Doh! I forgot one important piece of this algorithm - it is necessary
to initialise the result to something non-zero at the start so that
adding leading nulls to an array changes the final result.

Regards,
Dean

Attachment Content-Type Size
array-hashing2.patch application/octet-stream 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Leonardo Francalanci 2011-05-20 08:37:20 Re: switch UNLOGGED to LOGGED
Previous Message Kevin Grittner 2011-05-20 02:42:32 Re: ts_rank