Re: [HACKERS] How to implement a SP-GiST index as a extension module?

From: Connor Wolf <connorw(at)imaginaryindustries(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] How to implement a SP-GiST index as a extension module?
Date: 2017-11-15 06:00:46
Message-ID: CAAVqP=q=xuC+SsDc9fwMUFPzJ39obsZURriFjTnKTLxb-N5AWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I initially assumed that as well, but I found some somewhat confusing
documentation about implementing this as a plain GiST tree. Mostly, the
BK-Tree is a inherently unbalanced tree, and some of the documentation for
plain GiST indexes claims that GiST indexes can only be created on balanced
tree structures.

> GiST stands for Generalized Search Tree. It is a *balanced*,
tree-structured access method,

Sidenote: What is a "signature tree?"

I also had a fair bit of trouble getting going due to the strange
terminology. The use of "consistent" for "matches condition" is
particularly odd. I'm not sure if this is a translation oddment, but it
makes following things a lot more confusing. I suspect this is set match
bleeding into the implementation, but I don't see any reason the
"consistent" functions couldn't be simply named "matches_query_conditions",
with the benefit of no longer requiring familiarity with mathematics
terminology to easily understand how things are supposed to work.

Additionally, I don't see how to map some of the calls to a BK-tree. In
particular, there's no way to implement the GiST "union" call in the
context of a b-tree, since a single tree node can only represent a single
value.

Also, the lack of decent examples for the GiST index is challenging.
Really, it'd be really helpful if there was a e a GiST and SP-GiST index
example that only operates on one simple, one-dimensional datatype. I'm not
sure about other people, but I basically always find it much more effective
to start with a working project, and modify it rather then try to start
something from scratch.

As it is, there is no example that is a good starting basis for a custom
GiST index, and there was no example that was a good starting basis for a
SP-GiST index. This is particularly true for people (like me) who haven't
worked with postgresql's source or extensions at all before this.

I actually spent some time trying to simplify the GiST b-tree example to
the point where it was possible to follow it, but all the complexity
introduced by the fact that the b-tree example works for basically every
data type (and the dynamic dispatch that entails) makes it very hard to
follow for someone new to the codebase.

I'd still like to look at converting to a pure GiST based index, but the
first step of that will probably be implementing a very basic, boring 2-ary
B-tree index across a integral data-type. That way, I can separate
implementation issues with logic issues.

Connor

On Tue, Nov 14, 2017 at 1:53 AM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> On Tue, Nov 14, 2017 at 6:08 AM, Connor Wolf <connorw(at)imaginaryindustries(dot)
> com> wrote:
>
>>
>> On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <
>> a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>
>>> Hi!
>>>
>>> On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <
>>> connorw(at)imaginaryindustries(dot)com> wrote:
>>>
>>>> Ok, I've managed to get my custom index working.
>>>>
>>>
>>> Good!
>>>
>>> It's all on github here: https://github.com/fake-name/pg-spgist_hamming,
>>>> if anyone else needs a fuzzy-image searching system
>>>> that can integrate into postgresql..
>>>>
>>>> It should be a pretty good basis for anyone else to use if they want to
>>>> implement a SP-GiST index too.
>>>>
>>>
>>> I took a look at the code, and I feel myself a bit confused :)
>>> It appears that you're indexing int8 values. That seems like
>>> unrealistic short representation for image signature.
>>>
>>
>> It is a int8, and nope, it's a surprisingly robust and functional
>> signature. There's a document describing the hashing mechanism here:
>> http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
>>
>> Functionally, the procedure is relatively simple:
>>
>> - Convert to greyscale
>> - Resize to intermediate resolution (32x32 is common)
>> - Perform DCT on 32x32 image.
>> - Crop 32x32 image to 8x8 by throwing away the high-frequency
>> components
>> - Threshold the 8x8 image by it's average
>> - Serialize the 64 binary values into a int8
>>
>> In my case, the actual implementation is here: https://github.com/fake-
>> name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102
>>
>>
>>> Also, name of repository make me think that hamming distance would be
>>> used to compare signatures. But after look at the code, I see that plain
>>> absolute value of difference is used for that purpose.
>>>
>>> static double
>>> getDistance(Datum v1, Datum v2)
>>> {
>>> int64_t a1 = DatumGetInt64(v1);
>>> int64_t a2 = DatumGetInt64(v2);
>>> int64_t diff = Abs(a1 - a2);
>>> fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff);
>>> return diff;
>>> }
>>>
>>> For such notion of distance, you don't need a VP-tree or another complex
>>> indexing. B-tree is quite enough in this case. Alternatively, distance
>>> function is not what it meant to be.
>>>
>>>
>> You're looking in the wrong place.
>>
>> https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the
>> code you sent me, with some simplification to make it only work on single
>> integers. Basically,
>> before I started on my own stuff, I wanted to make sure I could at least
>> implement a functional index using a much more basic structure.
>>
>> https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the
>> actual BK-tree index, and it does indeed use hamming distance for the
>> search metric:
>>
>> static int64_t
>> f_hamming(int64_t a_int, int64_t b_int)
>> {
>> /*
>> Compute number of bits that are not common between `a` and `b`.
>> return value is a plain integer
>> */
>> uint64_t x = (a_int ^ b_int);
>> uint64_t ret = __builtin_popcountll (x);
>>
>> return ret;
>> }
>> (see https://github.com/fake-name/pg-spgist_hamming/blob/mas
>> ter/bktree/bktree.c#L38-L58).
>>
>>
>>> It would be useful if you provide complete usage example of this
>>> extension: from image to signature conversion to search queries.
>>>
>>
>>
>>
>> Actual usage is done with this project: https://github.com/fa
>> ke-name/IntraArchiveDeduplicator, which
>> also has the older in-memory BK tree
>> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp> I've
>> implemented, and it's actually used here
>> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/dbPhashApi.py#L173>
>> .
>> I also have unit tests that sit on top of this here
>> <https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests> (see
>> all the files that are named "Test_db_BKTree...".
>>
>
> OK. That explains the things, thank you.
> For such kind of index, it's probably not even necessary to use SP-GiST.
> GiST signature tree could work in this case as well (it would be probably
> even better).
> It would be nice if you write about it some blog post to planet PostgreSQL
> <https://planet.postgresql.org/>.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-11-15 06:17:28 Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath
Previous Message Fabien COELHO 2017-11-15 05:56:38 Re: [HACKERS] pgbench regression test failure