From: | Jay Levitt <jay(dot)levitt(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | PG Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Designing an extension for feature-space similarity search |
Date: | 2012-02-16 18:18:24 |
Message-ID: | 4F3D4870.3020704@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Jay Levitt<jay(dot)levitt(at)gmail(dot)com> writes:
>> - I'm not sure how to represent arbitrary column-like features without
>> reinventing the wheel and putting a database in the database.
>
> ISTM you could define a composite type and then create operators and an
> operator class over that type. If you were trying to make a btree
> opclass there might be a conflict with the built-in record_ops opclass,
> but since you're only interested in GIST I don't see any real
> roadblocks.
Perfect. Composite types are exactly what I need here; the application can
declare its composite type and provide distance functions for each member,
and the extension can use those to calculate similarity. How do I introspect
the composite type's pg_class to see what it contains? I assume there's a
better way than SPI on system catalogs :) Should I be using systable_*
functions from genam, or is there an in-memory tree? I feel like funcapi
gets me partway there but there's magic in the middle.
Can you think of any code that would serve as a sample, maybe whatever
creates the output for psql's \d?
> The main potential disadvantage of this is that you'd have
> the standard tuple header as overhead in index entries --- but maybe the
> entries are large enough that that doesn't matter, and in any case you
> could probably make use of the GIST "compress" method to get rid of most
> of the header. Maybe convert to MinimalTuple, for instance, if you want
> to still be able to leverage existing support code for field extraction.
Probably not worth it to save the 8 bytes; we're starting out at about 20
floats per row. But good to know for later optimization...
>
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would. However, we only
> support domains over base types not composites, so this isn't really
> going to be a profitable direction for you anyway.
Actually, as mentioned to Alexander, I'm thinking of domains per feature,
not for the overall tuple, so birthdate<->birthdate differs from
now()<->posting_date. Sounds like that might work - I'll play.
>
>> - Does KNN-GiST run into problems when<-> returns values that don't "make
>> sense" in the physical world?
>
> Wouldn't surprise me. In general, non-strict index operators are a bad
> idea. However, if the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.
Yeah, that example conflated NULLs in the feature fields (we don't know your
birthdate) with <-> on the whole tuple. Oops.
I guess I can just test this by verifying that KNN-GiST ordered by distance
returns the same results as without the index.
Thanks for your help here.
Jay
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2012-02-16 18:29:10 | Re: patch for parallel pg_dump |
Previous Message | Dimitri Fontaine | 2012-02-16 17:42:26 | Re: Command Triggers |