Re: Hash Functions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>
Subject: Re: Hash Functions
Date: 2017-05-12 18:23:14
Message-ID: CA+TgmobDuGr+cUv4d_LJncbaSM-tYumF8Tp83QyF1Fq+BSrtFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'd vote that it's not, which means that this whole approach to hash
> partitioning is unworkable. I agree with Andres that demanding hash
> functions produce architecture-independent values will not fly.

If we can't produce architecture-independent hash values, then what's
the other option?

One alternative would be to change the way that we dump and restore
the data. Instead of dumping the data with the individual partitions,
dump it all out for the parent and let tuple routing sort it out at
restore time. Of course, this isn't very satisfying because now
dump-and-restore hasn't really preserved the state of the database;
indeed, you could make it into a hard failure by creating a foreign
key referencing a partition hash partition. After dump-and-restore,
the row ends up in some other partition and the foreign key can't be
recreated because the relationship no longer holds. This isn't
limited to foreign keys, either; similar problems could be created
with CHECK constraints or other per-table properties that can vary
between one child and another.

I basically think it's pretty futile to suppose that we can get away
with having a dump and restore move rows around between partitions
without having that blow up in some cases.

> Maintaining such a property for float8 (and the types that depend on it)
> might be possible if you believe that nobody ever uses anything but IEEE
> floats, but we've never allowed that as a hard assumption before.

I don't know how standard that is. Is there any hardware that
anyone's likely to be using that doesn't? TBH, I don't really care if
support for obscure, nearly-dead platforms like VAX or whatever don't
quite work with hash-partitioned tables. In practice, PostgreSQL only
sorta works on that kind of platform anyway; there are far bigger
problems than this. On the other hand, if there are servers being
shipped in 2017 that don't use IEEE floats, that's another problem.

What about integers? I think we're already assuming two's-complement
arithmetic, which I think means that the only problem with making the
hash values portable for integers is big-endian vs. little-endian.
That's sounds solveable-ish.

> Even architecture dependence isn't the whole scope of the problem.
> Consider for example dumping a LATIN1-encoded database and trying
> to reload it into a UTF8-encoded database. People will certainly
> expect that to be possible, and do you want to guarantee that the
> hash of a text value is encoding-independent?

No, I think that's expecting too much. I'd be just fine telling
people that if you hash-partition on a text column, it may not load
into a database with another encoding. If you care about that, don't
use hash-partitioning, or don't change the encoding, or dump out the
partitions one by one and reload all the roads into the parent table
for re-routing, solving whatever problems come up along the way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-05-12 18:45:47 Re: Hash Functions
Previous Message Peter Eisentraut 2017-05-12 17:54:08 Re: renaming "transaction log"