| From: | Connor Wolf <connorw(at)imaginaryindustries(dot)com> | 
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: [HACKERS] How to implement a SP-GiST index as a extension module? | 
| Date: | 2017-11-14 03:08:28 | 
| Message-ID: | CAAVqP=ra29L6bUc5n3mjzG5+Mv37i+uaGCvh_kzL6x1FSo1hXw@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Hi!
>
> On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <
> connorw(at)imaginaryindustries(dot)com> wrote:
>
>> Ok, I've managed to get my custom index working.
>>
>
> Good!
>
> It's all on github here: https://github.com/fake-name/pg-spgist_hamming,
>> if anyone else needs a fuzzy-image searching system
>> that can integrate into postgresql..
>>
>> It should be a pretty good basis for anyone else to use if they want to
>> implement a SP-GiST index too.
>>
>
> I took a look at the code, and I feel myself a bit confused :)
> It appears that you're indexing int8 values.  That seems like unrealistic
> short representation for image signature.
>
It is a int8, and nope, it's a surprisingly robust and functional
signature. There's a document describing the hashing mechanism here:
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
Functionally, the procedure is relatively simple:
   - Convert to greyscale
   - Resize to intermediate resolution (32x32 is common)
   - Perform DCT on 32x32 image.
   - Crop 32x32 image to 8x8 by throwing away the high-frequency components
   - Threshold the 8x8 image by it's average
   - Serialize the 64 binary values into a int8
In my case, the actual implementation is here:
https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102
> Also, name of repository make me think that hamming distance would be used
> to compare signatures.  But after look at the code, I see that plain
> absolute value of difference is used for that purpose.
>
> static double
> getDistance(Datum v1, Datum v2)
> {
>     int64_t a1 = DatumGetInt64(v1);
>     int64_t a2 = DatumGetInt64(v2);
>     int64_t diff = Abs(a1 - a2);
>     fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
>     return diff;
> }
>
> For such notion of distance, you don't need a VP-tree or another complex
> indexing.  B-tree is quite enough in this case.  Alternatively, distance
> function is not what it meant to be.
>
>
You're looking in the wrong place.
https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the
code you sent me, with some simplification to make it only work on single
integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least
implement a functional index using a much more basic structure.
https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the
actual BK-tree index, and it does indeed use hamming distance for the
search metric:
static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);
return ret;
}
(see
https://github.com/fake-name/pg-spgist_hamming/blob/master/bktree/bktree.c#L38-L58
).
> It would be useful if you provide complete usage example of this
> extension: from image to signature conversion to search queries.
>
Actual usage is done with this project:
https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>
I've
implemented, and it's actually used here
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173>
.
I also have unit tests that sit on top of this here
<https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see
all the files that are named "Test_db_BKTree...".
Connor
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Rowley | 2017-11-14 04:00:12 | Re: [HACKERS] path toward faster partition pruning | 
| Previous Message | Andres Freund | 2017-11-14 03:03:41 | Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple |