From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: 9.1 support for hashing arrays |
Date: | 2011-05-19 14:33:25 |
Message-ID: | BANLkTikx8zcek855j=t8tv-pS1xyq6KFiQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 17, 2011 at 2:44 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> The algorithm for this was discussed in the original thread
> (http://archives.postgresql.org/pgsql-hackers/2010-10/msg02050.php)
> but I don't that think a satisfactory conclusion was really reached.
> In particular, it is way too easy to come up with pathological cases
> that defeat the hashing algorithm, for example:
>
> CREATE TABLE foo(a int[][]);
> INSERT INTO foo SELECT array_fill(i, ARRAY[8,8])
> FROM generate_series(1,10000) g(i);
>
> All 10000 arrays are different, but they all have the same hash value
> (0), so if the query optimiser chooses to hash the arrays, the
> performance will be very poor.
>
> A few people on that thread (myself included -
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00123.php)
> suggested using the multiply-by-31 algorithm but I think I failed to
> properly make the case for it. Having given it some further thought, I
> think there are some very sound mathematical reasons why that
> algorithm performs well:
>
> The algorithm is to take the current hash total, multiply it by 31 and
> then add on the hash of the next element. The final result is a
> polynomial sum, where each element's hash value is multiplied by a
> different power of 31.
>
> Since this is all modulo 2^32 arithmetic, the powers of 31 will
> eventually start repeating, and at that point the hashing algorithm
> could be defeated by transpositions. However, the number 31 has the
> property that its powers don't repeat for a long time - the powers of
> 31 modulo 2^32 form a cyclic group with a multiplicative order of 2^27
> (134217728). In other words 31^134217728 = 1 mod 2^32, and there are
> no smaller (strictly positive) powers of 31 for which this is the
> case.
>
> So the multiply-by-31 algorithm is only vulnerable to transpositions
> once the arrays reach 134217728 elements.
>
> For all smaller arrays, each array element's hash value is multiplied
> by a number different number from all the other elements, and since
> all the multipliers are odd numbers, *all* the individual bits from
> each element's hash value are distributed (differently) in the final
> value.
>
> Of course there are still going to be pathological cases, but they are
> very difficult to construct deliberately, and extremely unlikely to
> occur randomly. ISTM that this has all the properties of a good
> hashing algorithm (possibly the Java folks did a similar analysis and
> came to the same conclusion).
Yes, I never was very happy with the way that we were doing this, and
I think you make a coherent argument for why we should do it
differently.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Selena Deckelmann | 2011-05-19 14:44:40 | Re: Adding an example for replication configuration to pg_hba.conf |
Previous Message | Robert Haas | 2011-05-19 14:32:32 | Re: Cannot build docs of 9.1 on Windows |