Re: [HACKERS] How to implement a SP-GiST index as a extension module?

From: Connor Wolf <connorw(at)imaginaryindustries(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date: 2017-11-14 03:08:28
Message-ID: CAAVqP=ra29L6bUc5n3mjzG5+Mv37i+uaGCvh_kzL6x1FSo1hXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> Hi!
>
> On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <
> connorw(at)imaginaryindustries(dot)com> wrote:
>
>> Ok, I've managed to get my custom index working.
>>
>
> Good!
>
> It's all on github here: https://github.com/fake-name/pg-spgist_hamming,
>> if anyone else needs a fuzzy-image searching system
>> that can integrate into postgresql..
>>
>> It should be a pretty good basis for anyone else to use if they want to
>> implement a SP-GiST index too.
>>
>
> I took a look at the code, and I feel myself a bit confused :)
> It appears that you're indexing int8 values. That seems like unrealistic
> short representation for image signature.
>

It is a int8, and nope, it's a surprisingly robust and functional
signature. There's a document describing the hashing mechanism here:
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Functionally, the procedure is relatively simple:

- Convert to greyscale
- Resize to intermediate resolution (32x32 is common)
- Perform DCT on 32x32 image.
- Crop 32x32 image to 8x8 by throwing away the high-frequency components
- Threshold the 8x8 image by it's average
- Serialize the 64 binary values into a int8

In my case, the actual implementation is here:
https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102

> Also, name of repository make me think that hamming distance would be used
> to compare signatures. But after look at the code, I see that plain
> absolute value of difference is used for that purpose.
>
> static double
> getDistance(Datum v1, Datum v2)
> {
> int64_t a1 = DatumGetInt64(v1);
> int64_t a2 = DatumGetInt64(v2);
> int64_t diff = Abs(a1 - a2);
> fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
> return diff;
> }
>
> For such notion of distance, you don't need a VP-tree or another complex
> indexing. B-tree is quite enough in this case. Alternatively, distance
> function is not what it meant to be.
>
>
You're looking in the wrong place.

https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the
code you sent me, with some simplification to make it only work on single
integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least
implement a functional index using a much more basic structure.

https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the
actual BK-tree index, and it does indeed use hamming distance for the
search metric:

static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);

return ret;
}
(see
https://github.com/fake-name/pg-spgist_hamming/blob/master/bktree/bktree.c#L38-L58
).

> It would be useful if you provide complete usage example of this
> extension: from image to signature conversion to search queries.
>

Actual usage is done with this project:
https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>
I've
implemented, and it's actually used here
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173>
.
I also have unit tests that sit on top of this here
<https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see
all the files that are named "Test_db_BKTree...".

Connor

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-11-14 04:00:12 Re: [HACKERS] path toward faster partition pruning
Previous Message Andres Freund 2017-11-14 03:03:41 Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple