Question regarding fast-hashing in PGSQL

From: Stephen Conley <cheetah(at)tanabi(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Question regarding fast-hashing in PGSQL
Date: 2019-09-18 16:41:28
Message-ID: CAFbQXuFVR6E0vXgn2Pmz9zZj5wLG56aPg1GptUfW7NtYP6NNSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hey there;

I have a weird use case where I am basically taking data from many
different sources and merging it into a single table, while trying to avoid
duplicates as much as possible. None of them share any kind of primary
key, but I have determined 3 columns that, together, will almost always be
unique so I am planning on using those 3 columns as a composite primary key.

Two of those columns are integers, which is great. The third column is a
string, UTF-8, which may be quite long (though probably no longer than 50
characters ... on average probably around 10 - 30 characters). The strings
could be practically anything, and they absolutely will not be unique on
their own (these three data points are basically x, y coordinates and then
some name...for a given x,y coordinate there may be multiple names, but the
likihood of the same name at the same x, y is almost 0)

I really don't want to do a string comparison if possible because this DB
will be getting very large, very quickly -- 20 million or so rows
anticipated in the near future (i.e. next few weeks), with possible growth
up to around 200 million (1+ year later).

My idea was to hash the string to a bigint, because the likelihood of all 3
columns colliding is almost 0, and if a duplicate does crop up, it isn't
the end of the world.

However, Postgresql doesn't seem to have any 'native' hashing calls that
result in a bigint. The closest I've found is pgcrypto's 'digest' call --
I could theoretically take an md5 hash, and just use the first 8 bytes of
it to make a bigint.

HOWEVER... there is no straight forward way to do this. The most straight
forward way I've seen is md5 -> hex string -> substring -> bigint. This is
ridiculous to me -- I'm basically converting binary to text, then
converting text to binary. However, you can't convert a bytea to a bigint
in any fashion that I can tell so I have to eat a bunch of overhead for fun.

What would be the fastest way to do this? I will be generating potentially
a LOT of these keys so I want to do it the least dumb way. I am using
Digital Ocean's hosted PostgreSQL so I can't use my own C code -- but I can
use PL/Psql, PL/Perl or any of these extensions:

https://www.digitalocean.com/docs/databases/postgresql/resources/supported-extensions/

If my concerns about string comparisons are unfounded and I'm working way
too hard to avoid something that doesn't matter ... feel free to tell me
that as well. Basically, PostgreSQL performance guys, how would you tackle
this one?

Thanks,

Stephen

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Adam Brusselback 2019-09-18 16:49:58 Re: Question regarding fast-hashing in PGSQL
Previous Message Jeff Janes 2019-09-17 14:25:19 Re: does max_connections affect the query planner