From: | Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
---|---|
To: | Connor Wolf <connorw(at)imaginaryindustries(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] How to implement a SP-GiST index as a extension module? |
Date: | 2017-11-14 09:53:53 |
Message-ID: | CAPpHfduY_JFqe35dz_T9L=WhovUxJYMYPw7w2moYSxO-eQuUOw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Nov 14, 2017 at 6:08 AM, Connor Wolf <
connorw(at)imaginaryindustries(dot)com> wrote:
>
> On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <
> a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
>> Hi!
>>
>> On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <
>> connorw(at)imaginaryindustries(dot)com> wrote:
>>
>>> Ok, I've managed to get my custom index working.
>>>
>>
>> Good!
>>
>> It's all on github here: https://github.com/fake-name/pg-spgist_hamming,
>>> if anyone else needs a fuzzy-image searching system
>>> that can integrate into postgresql..
>>>
>>> It should be a pretty good basis for anyone else to use if they want to
>>> implement a SP-GiST index too.
>>>
>>
>> I took a look at the code, and I feel myself a bit confused :)
>> It appears that you're indexing int8 values. That seems like unrealistic
>> short representation for image signature.
>>
>
> It is a int8, and nope, it's a surprisingly robust and functional
> signature. There's a document describing the hashing mechanism here:
> http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
>
> Functionally, the procedure is relatively simple:
>
> - Convert to greyscale
> - Resize to intermediate resolution (32x32 is common)
> - Perform DCT on 32x32 image.
> - Crop 32x32 image to 8x8 by throwing away the high-frequency
> components
> - Threshold the 8x8 image by it's average
> - Serialize the 64 binary values into a int8
>
> In my case, the actual implementation is here: https://github.com/fake-
> name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102
>
>
>> Also, name of repository make me think that hamming distance would be
>> used to compare signatures. But after look at the code, I see that plain
>> absolute value of difference is used for that purpose.
>>
>> static double
>> getDistance(Datum v1, Datum v2)
>> {
>> int64_t a1 = DatumGetInt64(v1);
>> int64_t a2 = DatumGetInt64(v2);
>> int64_t diff = Abs(a1 - a2);
>> fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
>> return diff;
>> }
>>
>> For such notion of distance, you don't need a VP-tree or another complex
>> indexing. B-tree is quite enough in this case. Alternatively, distance
>> function is not what it meant to be.
>>
>>
> You're looking in the wrong place.
>
> https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the
> code you sent me, with some simplification to make it only work on single
> integers. Basically,
> before I started on my own stuff, I wanted to make sure I could at least
> implement a functional index using a much more basic structure.
>
> https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the
> actual BK-tree index, and it does indeed use hamming distance for the
> search metric:
>
> static int64_t
> f_hamming(int64_t a_int, int64_t b_int)
> {
> /*
> Compute number of bits that are not common between `a` and `b`.
> return value is a plain integer
> */
> uint64_t x = (a_int ^ b_int);
> uint64_t ret = __builtin_popcountll (x);
>
> return ret;
> }
> (see https://github.com/fake-name/pg-spgist_hamming/blob/
> master/bktree/bktree.c#L38-L58).
>
>
>> It would be useful if you provide complete usage example of this
>> extension: from image to signature conversion to search queries.
>>
>
>
>
> Actual usage is done with this project: https://github.com/fake-name/
> IntraArchiveDeduplicator, which
> also has the older in-memory BK tree
> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp> I've
> implemented, and it's actually used here
> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173>
> .
> I also have unit tests that sit on top of this here
> <https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see
> all the files that are named "Test_db_BKTree...".
>
OK. That explains the things, thank you.
For such kind of index, it's probably not even necessary to use SP-GiST.
GiST signature tree could work in this case as well (it would be probably
even better).
It would be nice if you write about it some blog post to planet PostgreSQL
<https://planet.postgresql.org/>.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2017-11-14 10:42:46 | Re: [HACKERS] Pluggable storage |
Previous Message | David Rowley | 2017-11-14 09:51:47 | Re: [HACKERS] Proposal: Local indexes for partitioned table |