Re: Storing and comparing columns of cryptographic hashes?

From: Justin <zzzzz(dot)graf(at)gmail(dot)com>
To: Josh Triplett <josh(at)joshtriplett(dot)org>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Storing and comparing columns of cryptographic hashes?
Date: 2024-04-10 00:06:53
Message-ID: CALL-XeNCDzR-Kk49Z9SLwZ7zJ1jAQLd+FXrGtkL+oxRSYr3_Dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett <josh(at)joshtriplett(dot)org> wrote:

>
>
> - Is there a way to tell postgresql "this column contains cryptographic
> hashes, so you can do hash joins using any subset of the bits, without
> having to hash them again"? If not, should there be?
>

if you know the specific subset of the bits from the hash value ahead of
time, you can create an index on a function to extract the subset of bits
ahead of time by putting them in an index. Consider

create table dd2 (id integer, hash_b bytea); --create example table
insert into dd2 values (1, '/x1234568'::bytea); --throw a record at it
create index on dd2 (substr ( hash_b,1,5)); -- create btree index on
subset of value
select * from dd2 WHERE substr(hash_b,1,5) = '/x123'; --this query will
use the index once there enough data to justify an index scan

> - Is `bit(256)` the right type to use to store 32-byte hash values with
> no additional overhead?
>

Keep them in the bytea format, can easily cast from bytea to hex for
debugging purposes.

select encode(hash_b, 'hex') from dd2

The Larger the data size the slower all operations are best to keep them in
same data type instead of casting in and out different data types

> - What would be the simplest way, given an input array of hashes (which
> I may have to pass in as an array and use `unnest`), to filter out all
> the values that already exist, *and* generate a corresponding bitmap
> in the same order for present/not-present for the entire array (to
> send back to the user)? Filtering seems easy enough, but generating
> the bitmap less so.
>

If you have unique index on the bytea hash, you can unnest the array to a
table, then insert values with ON CONFLICT

CREATE unique index on dd2(hash_b);
INSERT into dd2 (hash_b)
(SELECT * from unnest(array['/x1234568','/x12345','/x12346', '/x12347',
'/x12348' ]::bytea[]))
ON CONFLICT (hash_b) DO NOTHING
RETURNING hash_b -- this will return the rows that got inserted

Then you can compare the inserted vs the inputted array to know what is
present vs not present.

> - Does it make more sense to store the values as one row per value, or
> as one row per group of values? I know that postgresql can store an
> entire array in one column; could that efficiently support operations
> like "tell me which of these objects don't exist in any array in this
> column" or "for all of these objects, tell me all the group-id values
> for rows containing them"?
>

Putting them into arrays to keep them grouped together adds additional
overhead to unnest for searching or join operations. If the data is going
to be large and have to scan over large data sets with arrays containing
many hash values thats allot of data processing. Arrays can be indexed but
have to use GIN index, which has many drawbacks compared to btree. None of
the above queries are possible with GIN indexes or using array
columns without a lot more code.

Arrays are not data sets if the design needs to access a specific hash
value for update,delete, append new values, an array probably not the best
solution.
Hope this helps
Justin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John Bateson 2024-04-10 10:11:21 Some advice need after a 20 year gap after Ingres/GUIs
Previous Message Adrian Klaver 2024-04-09 22:35:11 Re: Issue with date/timezone conversion function