Re: Hash Functions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Joe Conway <mail(at)joeconway(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>
Subject: Re: Hash Functions
Date: 2017-08-16 17:16:50
Message-ID: CA+TgmoaBy0qJfYKo9XK2==kfiED1DbiKeKDVk0DbpcJiCDBg5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 16, 2017 at 12:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> After some further thought, I propose the following approach to the
>> issues raised on this thread:
>
>> 1. Allow hash functions to have a second, optional support function,
>> similar to what we did for btree opclasses in
>> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845. The second function will
>> have a signature of (opclass_datatype, int64) and should return int64.
>> The int64 argument is a salt. When the salt is 0, the low 32 bits of
>> the return value should match what the existing hash support function
>> returns. Otherwise, the salt should be used to perturb the hash
>> calculation.
>
> +1

Cool.

>> 2. Introduce a new hash opfamilies here which are more faster, more
>> portable, and/or better in other ways than the ones we have today.
>
> This part seems, uh, under-defined and/or over-ambitious and/or unrelated
> to the problem at hand. What are the concrete goals?

In my view, it's for the person who proposed a new opclass to say what
goals they're trying to satisfy with that opclass. A variety of goals
that could justify a new opclass have been proposed upthread --
especially in Jeff Davis's original post (q.v.). Such goals could
include endian-ness independence, collation-independence, speed,
better bit-mixing, and opfamilies that span more types than our
currently ones. These goals are probably mutually exclusive, in the
sense that endian-ness independence and collation-independence are
probably pulling in the exact opposite direction as speed, so
conceivably there could be multiple opclasses proposed with different
trade-offs. I take no position for the present on which of those
would be worth accepting into core.

I agree with you that all of this is basically unrelated to the
problem at hand, if "the problem at hand" means "hash partitioning".
In my mind, there are two really serious issues that have been raised
on that front. One is the problem of hash joins/aggregates/indexes on
hash-partitioned tables coming out lopsided, and adding an optional
salt will let us fix that problem. The other is that if you migrate
your data to a different encoding or endianness, you might have
dump/reload problems, but IMHO the already-committed patch for
--load-via-partition-root is as much as really *needs* to be done
there. I am less skeptical about the idea of endianness-independent
hash functions than he is, but "we can't have
$FEATURE_MANY_PEOPLE_WANT until we solve
$PROBLEM_ANDRES_THINKS_IS_NOT_PRACTICALLY_SOLVABLE" is not a route to
swift victory.

In short, I'm proposing to add a method to seed the existing hash
functions and get 64 bits out of them, and leaving any other potential
improvements to someone who wants to argue for them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2017-08-16 17:19:29 Re: Atomics for heap_parallelscan_nextpage()
Previous Message Andres Freund 2017-08-16 17:15:51 Re: Orphaned files in base/[oid]