From: | Connor Wolf <wolf(at)imaginaryindustries(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [GENERAL] Understanding and implementing a GiST Index |
Date: | 2014-10-10 07:03:56 |
Message-ID: | 543784DC.9020107@imaginaryindustries.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
I'd be ok with either SP-GiST or GiST, though there are tree structures
(BK tree) that are particularly suited to this application that I
/think/ map to GiST better then SP-GiST.
Re: hamming in the RD-tree implementation: Where? I cloned the smlar git
repo, and poked around, and it looks like it only can operate on set
intersections, which is a non-starter for what I need to do. I spent a
while looking through the source, and I didn't see anything that looked
like hamming distance calculation (through I probably missed some stuff.
The indirection through `FCall2()` makes things very hard to follow).
Anyways, right now, smlar is not useful to me, because it completely
ignores the structure of incoming arrays:
postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
smlar
-------
1
(1 row)
For two radically different hashes (shortened for example purposes)
which compare as identical.
I spent some time trying to see if I could easily disable the array
uniquefication, and by disabling the calls to `qsort_arg`, and modifying
`numOfIntersect` so it just counts the number of times the array
elements do not match, and I'm getting the behaviour I want out of
`smlar()` at this point:
postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
smlar
--------
0.4375
(1 row)
postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
);
smlar
-------
0.5
(1 row)
But the index does not seem to work after the changes I made, and the
debugging print statements (well, `elog()` statements) I inserted into
`cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don't
understand how the index is even being built.
Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor
On 10/9/2014 8:11 AM, Oleg Bartunov wrote:
> Just quick recommendation.
> You need to ask -hackers mailing list for that. Also, do you want GiST
> or SP-GiST ?
> We already use hamming distance in rd-tree, which implemented in GiST,
> see our presentation
> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf
> <http://www.sai.msu.su/%7Emegera/postgres/talks/pgcon-2012.pdf>, for
> example.
>
> Oleg
>
> On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf
> <wolf(at)imaginaryindustries(dot)com <mailto:wolf(at)imaginaryindustries(dot)com>>
> wrote:
>
> Hi there!
>
> I'm trying to implement a custom indexing system using the GiST
> index framework, and I have a few questions.
> Basically, I'm trying to implement a system that allows me to
> search across a set of 64 bit integers by hamming distance. This
> is intended to be used in searching for similar images, where the
> 64 bit integer is actually a phash
> <http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html>
> of an image, and similarity between two images is reflected in the
> hamming distance between two integers.
>
> Anyways, The appropriate approach here is to use something called
> a BK tree
> <http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
> for which I've written some test code
> <https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
> and I think I have a decent grip of (my test code seems to work,
> in any event).
>
> That said:
>
> * Is there a /simple/ piece of example-code for implementing a
> GiST index for a single index? I've been looking through the
> /contrib/ directory, particularly the /contrib/btree_gist/
> files, but it looks like 90% of the complexity there is
> related to supporting all the column types Postgres has,
> rather then actually tied to producing a functional index.
> * Once I have something compiling, how can I check to be sure
> that I'm actually using the indexing module I created? I can
> enable my compiled extension using `CREATE EXTENSION`, but is
> there an option for `CREATE INDEX test_index ON testing USING
> gist (val);` that lets me specify *which* GiST index is
> actually employed? Is this even a valid question?
> This is probably something that's available in one of the
> system tables somewhere, but I haven't had much luck with
> google finding out where.
> * Testing: What's the appropriate way to examine the generated
> tree structure of the index? I certainly went through a few
> bugs with my test tree system (and that was in python!). Is
> there any way to examine the index structure for debugging
> purposes?
> * Also, it looks like `ereport()` is the proper way to emit
> debugging information. Is this correct?
> * In that vein, is there any way to have information that is on
> the operations of an entire query? Looking at the number of
> tree nodes touched for a scan would be nice (and I would not
> be surprised if there is already a facility for it).
>
> Project code is here
> <https://github.com/fake-name/pg-spgist_hamming> if anyone is
> interested, any help would be great. I have very little idea what
> I'm doing.
>
> Thanks,
> Connor
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Nick Barnes | 2014-10-10 11:15:17 | FK check implementation |
Previous Message | Connor Wolf | 2014-10-10 03:09:07 | Re: Understanding and implementing a GiST Index |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2014-10-10 07:33:08 | Re: Obsolete reference to _bt_tuplecompare() within tuplesort.c |
Previous Message | Heikki Linnakangas | 2014-10-10 06:58:30 | Re: Obsolete reference to _bt_tuplecompare() within tuplesort.c |